Wednesday, May 24, 2017

Thoughts on HyperConverged, and the Future of HyperConverged (Part 2)

So how did we get here? Where did HCI come from?

If we look back at the history of HCI, it seems to have evolved from the idea of using clustered, "whitebox" x86 servers to create a clustered storage system. There were a number of early entrants in the space, some dating back to 2006. Another vector was the idea of a "Virtual Storage Appliance" or VSA, software which ran in a VM, connected to local server hard disk drives, and presented that internal storage to the guest VMs over the internal IP network. The first VSA was from Lefthand in 2007. But the real hyper-converged push started around 2009 with the founding of integrated HCI players Nutanix and SimpliVity.

We also have to look at where the HCI market is today. It is arguably dominated by three primary players: Nutanix; SimpliVity (now part of HPE); and VMware VSAN. They represent the lion's share of the HCI market, and we will come back to them.

If you look at the earlier clustered storage companies, they either offered a scale-out NAS, a kind of commodity alternative to Isilon, a scale-out block storage solution, or a scale-out unified storage solution. These early players came into existence when "grid computing" was the buzzterm of the day, and these architectures were also called "grid storage".

In 2009 Nutanix was founded. There were other virtual storage appliance start-ups, such as Virsto Software (which eventually became VMware VSAN), but it is fair to define the official beginning of the hyper-converged era as August 2011, when Nutanix emerged from stealth. The same month, VMware released vSphere 5 which included its first implementation of a VSA (vSphere Storage Appliance). SimpliVity would emerge from stealth one year later in August 2012. VMware's VSA did not gain traction, and VMware announced its intent to acquire Virsto six months later in February 2013 which represented VMware's serious interest in HCI.

As Nutanix and SimpliVity started to grow, and with VMware's very public acquisition of Virsto, and obvious plans to enter the HCI market, many of the earlier clustered storage vendors and virtual storage appliance vendors redefined themselves as hyper-converged players. Several new industry buzzterms were developed: "Server SAN"; "Virtual SAN"; and "Software Defined Storage", or "SDS".

Many of the early clustered storage system vendors redefined themselves as SDS or HCI players, moving their clustered storage software from bare-metal to run in VMs, and allowing their clustered storage software to run alongside guest VMs on the same server. VSA vendors added more sophisticated clustering, replication, and scalability to their products.

From this, it is fair to say modern HCI owes itself to three parents: Commodity clustered storage systems; virtual storage appliances; and purpose built integrated HCI systems.

To me, the most interesting thing is many of the earlier clustered storage or "grid storage" players had little to no success, but the HCI players saw significant early success. Part of this may have been how each targeted the market. Clustered/grid storage historically had been seen as targeting the high-performance and academic community for technical computing use cases. HCI targeted business organizations and VMware virtualization workloads.

But what cannot be dismissed is the reality the early clustered storage ystems did not provide the level of performance and reliability required for enterprise workloads. The early clustered storage systems were not designed for transactional, random I/O workloads. They were better suited for sequential I/O. The early HCI players focused on addressing write latency and random I/O with aggressive write and read caching. The also focused on ease of use and eliminating the need for storage administrators to provision storage to VMware administrators.

At this point it is interesting to note, there were other players aggressively targeting VMware virtualized workloads. Tintri had come out of stealth five months before Nutanix with its VMware optimized storage platform. It too targeted the VMware admin and sought to used its product to bypass the traditional storage management team in an organization.

So that is the history lesson and the end of Part 2.

Sunday, May 21, 2017

Thoughts on HyperConverged, and the Future of HyperConverged (Part 1)

Almost two years ago I made some observations on HyperConverged Infrastructure, and where I think it needed to go to be successful. I posted these to Twitter at the time. I still stand by some of those observations, for others I am not as sure. But I have done a lot more thinking about the HCI phenomena, and believe change is coming to HCI.

To this point, I recently saw an update of the Gartner Hype Cycle, which showed HCI at the zenith of the "Peak of Inflated Expectations". I agree with this. The question is what comes next? Probably a vendor shake-out.

But another question to ask is "What comes after HCI?" The idea HCI is the end-game for IT infrastructure is a naive assumption. There may be better architectures being worked on by start-ups as I write this.

These were my original observations on HCI:

HCI must support multiple hypervisors, and no hypervisor (i.e., Containers, Hadoop, Oracle RAC, etc.).

At the time, Microsoft was pushing Hyper-V very hard, and I thought Hyper-V was going to make significant penetration into the enterprise. At the same time, some organizations were experimenting with OpenStack and KVM. Today, looking back, VMware still dominates. Hyper-V exists mainly in on-prem Azure Stack deployments, and KVM struggles without a single brand behind it.

As for no-hypervisor HCI (my idea being a combination of OpenStack with Containers and an HCI filesytem embedded in Linux for something like Oracle RAC), this has yet to take off. There is a chance we could see something like it for OpenStack.

HCI must become all-flash for virtualized workloads.

For the most part, this has become true. And the reality is, All-Flash saved HCI, which probably would not have been able to keep up with the performance requirements of virtualized workloads in its hybrid form.

HCI filesystems must be or become flash aware (WAF, etc.).

HCI filesystems have been adapted for flash, but I do not believe they have reached a point to make them comparable to All Flash Arrays in reducing flash wear. They have been able to avoid this by using high Drive Write Per Day (DWPD) SSDs in their caching tier to coalesce writes to low DWPD SSDs in their capacity tier. I see two problems with this approach. The first is the use of a high DWPD SSD as a cache is a carry-over from the hybrid HCI filesystem architecture. There it provided a significant performance boost. When combined with an SSD capacity tier, it provides no performance boost, and only a write wear mitigation benefit. The second issue is high DWPD SSDs are not a high volume part for SSD manufacturers, who would rather manufacture lower DWPD, higher capacity, higher revenue SSDs. Ultimately, high DWPD SSDs may fade away like SLC and eMLC SSDs did. If that happens, what will HCI vendors do?

HCI must move to parity/erasure coding data protection and move away from mirroring/replication based data protection (RF2/RF3).

I believed this was necessary for All-Flash HCI due to the cost of flash, and the capacity of SSDs at the time. I am less sure of this now, at least as a $/GB requirement. I think parity/erasure coding will only be driven by availability requirements, and not $/GB requirements.

HCI must support storage only nodes and compute only nodes for asymmetric scaling.

I believe this even more today. With All-Flash HCI, storage efficiencies (a.k.a., Data Reduction technologies) became critical. When you look at the Virtual Desktop (VDI) use case for HCI, deduplication means storage capacity does not grow linearly with VDI instances. In fact, it hardly grows at all. But what does grow is a need for write caching. If I invested in HCI for VDI, and deployed 200 VDI instances across 4 HCI nodes, and later decided to grow my VDI to 400 instances, I might need 4 more nodes of compute, but deduplication might mean I need only 10% more storage capacity, which I might already have on my existing nodes. I might need a caching SSD on each new node, but not 5 to 11 data drives.

The reverse holds true as well. If I assume a certain storage efficiency ratio, but due to adding workloads with different data types (say pre-compressed image files) my storage efficiency drops, today I have to add compute and hypervisor instances (and associated licenses) just to gain access to more storage capacity. If I could add a storage only node or two, it would provide flexibility. Also, it might offer the ability to introduce tiering between an all-SSD production tier, and a NL-SAS capacity tier.

This is the end of Part 1. Over the next several parts, I will dig much deeper into these thoughts, including thoughts on what comes after HCI.