Issues in the Flash Translation Layer for Storage Arrays

Much of the current technical discussion surrounding the Flash Translation Layer (FTL) centers around two subjects.

  1. The relative merits of CPU-based FTL vs. SSD Controller based FTL for the PCIe attached flash products. The market leader in this space advocates the former while more current designs take the latter approach.
  2. The continuous algorithmic improvements of the FTL within SSD controllers to minimize write latency and amplification.

There is much work to be done on both fronts with the advent of Triple Level Cell (TLC) flash and the accompanying limitations.

Storage system designers face a different class of issues when many SSD devices are grouped in an enterprise array. The interaction of application, kernel and driver behavior with some features of the FTL in SSD controllers can be detrimental to performance and longevity. An enterprise storage system must consider the FTL as a composite abstraction covering software and hardware behavior within the array’s processing complex as well as the individual SSD devices used for persistent media.

In a thinly provisioned storage system, slabs of address space from the physical disks (or stripes of the physical disks) are doled out on demand. Traditional models would suggest trying to balance this allocation among all SSD devices based on aggregate write history. This may not be correct from a performance standpoint. Writing 100GB to an SSD device as a single 100GB sequential write versus overwriting the same 20GB address space 5 times is not equivalent. The same amount of physical writing to the flash devices has occurred but the latter case will leave the media with significantly more available blocks for future writing. Performance-based optimization can expect slabs allocated from this second case to see better write performance. Ideally, write performance of each SSD device should be tracked to “reverse engineer” the amount of performance lost due to previously written data (consuming spare blocks, incurring write amplification, etc.).

It is commonly believed that write amplification can be avoided by relying on sequential writing, such as a log-based structure. Experimental data shows this not to be the case, even when the application is using uniform block size writes. This can occur because both the operating system and HBA device driver may aggregate sequential writes causing actual writes to be integer multiples of the basic block size. When the same address space is overwritten sequentially without alignment of the actual physical writes, write amplification can occur. Enforcing the application’s block size and alignment may require OS tuning and firmware settings on the HBA when supported by the vendor.

As smaller lithographies and triple level technology enters the market, additional performance issues arise. Today, SSD manufacturers can warranty the life of the drive (sometimes by write throttling) based on the statistical model of flash cell life and overprovisioned space. However, current ECC schemes used in MLC are not as statistically relevant from a performance view. The write and read performance of today’s MLC drive is largely the same during the first year and last year of service as the overhead and statistical probability of corrections is relatively low. This can change dramatically with the wide ECC schemes proposed for TLC flash. Both the probability of hitting errors and the computational requirements for correcting these errors rises dramatically, resulting in constantly degrading performance over the lifetime. Error correction can be designed to deliver deterministic lifetime of TLC flash based on the physics and write cycles, but cannot simultaneously deliver consistent performance over that life.

These issues will be explored in more detail in future postings.

MLC Flash for Big Data Acceleration

Big data analysis demands bandwidth and concurrent access to stored data. Write load will depend on data ingest rates and batch processing demands. The data involved will typically be new data and updates of existing data. Indices and other metadata may be recalculated, but is generally not done in real time. The economics of supporting such workloads focus on the ability to cost effectively provide bulk access for concurrent streams. If only a single stream is being processed, spinning disk is fine. However, providing highly concurrent access to the dataset requires either a widely-striped caching solution or a clustered architecture with local disk (Hadoop). Because write lifetimes for flash are not stressed in this environment, utilizing wide stripes of MLC for caching is the most cost-effective way to provide highly concurrent access to the dataset in a shared-storage environment.

Now, a lot of the SLC versus MLC debate centers on blocking and write performance – specifically dealing with write latency and the blocking impact on reads. With traditional storage layout, data can be striped over only a few disks (4 data disks for stripes of RAID 5/6). This creates high read blocking probability for even the smallest write loads. By distributing the data over very wide non-RAID stripes (up to 40 disks wide), the affect of variable write latency can be mitigated by dynamically selecting least-read disks for new cache data and greatly reducing the impact of writes on the general read load. The wider the striping of physical disks in a caching media the greater the support for concurrent access and mixed read and write loads from the application. MLC is an excellent media choice, both technically and economically.

By employing affordable MLC as a write-through caching layer that is consistent with the backend storage, the effect of even multiple simultaneous flash SSD failures can be removed. Most traditional storage systems cannot survive multiple concurrent drive failures and suffer significant performance degradation when recovering (rebuilding) from a single device failure. Cache systems can continue operation in the face of cache media failures by simply fetching missing data from the storage system and redistributing to other caching media. However, it’s important to note that placing the cache in front of the storage controller is critical to achieving concurrency. The storage controller lacks the horsepower necessary to sustain performance – but that’s a topic for another day.

MLC is driving the price point of Flash towards that of enterprise high-performance spinning disk. The constant growth in the consumer space means that MLC will continue to be the most cost-effective flash technology and benefit the most from technology scaling and packaging innovations. Lower volume technologies such as eMLC and SLC do not share the same economic drivers and thus will continue to be much more expensive. The ability to utilize MLC efficiently and adapt the technology to meet the performance and access needs of Big Data will be hugely advantageous to customers and the vendors who can deliver intelligent, cost-effective solutions that utilize MLC – such as the GridIron TurboCharger™!