Wednesday, March 29, 2023

Brutal Efficiency

In a December 2006 interview with CNET, Sun Microsystems Chief Technology Officer, Greg Papadopoulos repeated the 1943 statement by IBM’s then CEO, Thomas J. Watson, that world only needed five computers. Papadopoulos was referring to the large service providers which were just starting to emerge. 2006 was also the year Amazon Web Services, now synonymous with cloud computing, released its S3 storage service and its EC2 compute service.

Papadopoulos also noted the large service providers, due to their scale and their investment in automation, were capable of driving “brutal efficiencies.” The web-scale services (web search, e-commerce, etc.) drove very high levels of utilization, and Papadopoulos believed the service providers would follow that model. That is exactly what happened with the hyperscale public cloud providers. They drive extreme levels of efficiency through secure virtualization and continuous capacity management. As a result, hyperscale service providers are now the standard for IT efficiency.

In the past, these levels of utilization and efficiency have been difficult to achieve in on-prem organization IT. VMware provided the hypervisor software that drove a wave of consolidation and efficiency improvements, but efficiency gains have stagnated since. The inability to operate on-premises organizational IT in a highly efficient manner is a large driver of moving on-premises software to SaaS providers, and on-premises compute to cloud providers. But in most cases, the costs of the “lift and shift” of heavy, traditional applications to the cloud proves more costly than operating them on-prem.

Another issue is current organization sustainability goals require new considerations about IT efficiency. In fact, in some cases, migrating on-prem software to SaaS, and lifting and shifting custom applications to cloud providers is just being done to outsource the electrical consumption for an organization so they better meet their sustainability goals.

But what happens when the two are in conflict? When the cost of running customer workloads in the cloud is higher than on-prem, but there is a desire to maximize the efficiency of IT to meet sustainability goals? The answer is private clouds and on-prem IT must operate with similar efficiency goals as public clouds. Another consideration is if an organization has real-estate consolidation initiatives that mean owned data centers go against the organization’s real estate strategies. This usually means owned IT resources are hosted in colocation facilities. Also, for those organizations looking to build a true hybrid cloud, there is the desire to move owned IT resources to cloud connected colocation facilities. But unlike an owned data center, where there might be plenty of available space, every square foot of a colo costs money. So, improving efficiency reduces colocation costs.

There is another factor driving the need for improving on-prem IT efficiency. Newer, denser CPUs and memory are consuming more power. Straightforward “one for one” replacement strategies will force either fewer servers per rack, or power and cooling investments in the data center. The cloud providers have no problem configuring servers with hundreds of cores and terabytes of RAM, then loading dozens of virtual machines from many different customers on the same server. But many traditional IT shops fear high consolidation ratios due to the “too many eggs in one basket” philosophy. Of course, the number of eggs that can be tolerated in one basket does grow over time, but not at the rate of Moore’s Law.

IT organizations need to look at VMs and servers the same way storage administrators looked at thin-provisioning on all-flash arrays. When less expensive hard drive systems dominated organizational data storage, it was easy to just thick provision everything. After all, it ensured performance and minimized issues and management efforts. But all-flash was considerably more expensive per TB, so thin-provisioning was necessary. Performance of all-flash was not an issue, so thick provision eager zeroed VMs, done to maximize performance as data in a VMDK grew, was no longer necessary. But it did impact management. Thin-provisioning was scary. What happened if something went wrong? What happened if there was a runaway data writing process? Could it fill the capacity of multiple thin-provisioned volumes and take down multiple apps? But for the last 8 years, all-flash arrays have been used and managed within IT organizations. At its optimum, it means a thin provisioned VM on a VMware datastore, on a thin provisioned LUN on the storage array. So, there is an experience base in “thin everywhere” and “thin on thin” (VMware thin provisioning on storage array thin provisioning) operations.

With each generation of CPUs increasing low-level virtualization features, and increased instruction level parallelism, both at a Moore’s Law rate that exceeds the growth of software ability to consume it, we should be seeing higher vCPU to core ratios. But increasingly, we are lower vCPU to core ratios due to the desire to avoid performance issues. While VMware memory sharing (transparent page sharing) is not used often due to security concerns, VMware memory overcommit features are safe and well understood, but are likely underutilized. While memory sharing is off by default on ESXi, it is on by default in VMware Cloud on AWS, as is ballooning and memory compression. VMware Cloud on AWS seeks to drive very high levels of efficiency. In essence, there are equivalents of “thin provisioning” virtual CPUs on physical CPU cores, thin provisioning virtual RAM on physical RAM, and thin provisioning virtual networks on physical networks. Another term for thin-provisioning in these cases is oversubscription, and we manage oversubscription with tools like QoS. Tools similar to QoS exist in storage (VMware Storage I/O Control, storage array QoS, etc.), CPU, and memory as well (VMware resource allocation shares, reservations, and limits, etc.). But we need deep visibility into storage IOPS, CPU usage, and memory consumption if we want to drive higher levels of oversubscription in these resources. But we must if we want more efficiency.

CPU, hypervisor, and network consolidation and virtualization features have increased dramatically over the last decade, affording business IT customers the opportunity to significantly increase consolidation, including higher vCPU to core ratios.

While lower than 50% CPU utilization at the host level is typical in VMware environments, it is also not unusual to also see VMs over-configured with vRAM. This often is due to ISV recommendations, which are often over-specified to ensure expected performance.

What is needed is visibility into the virtual and physical infrastructure to identify inefficient configurations, adjust them to eliminate inefficiencies and drive higher levels of utilization. A visibility tool must constantly monitor the environment, because after an initial “right-sizing”, reducing allocated resources to only what is needed, resource requirements may change and may grow, requiring later adjustments. The good thing is IT management tools have improved significantly over the last decade to allow efficient “as a service” approaches to business IT.

The incremental improvements in IT efficiency of the past are no longer sufficient. The potential for significant improvements in IT efficiency now exist. When properly implemented, the right tools allows both lower costs and the achievement of sustainability goals.