Issues in the Flash Translation Layer for Storage Arrays

Much of the current technical discussion surrounding the Flash Translation Layer (FTL) centers around two subjects.

  1. The relative merits of CPU-based FTL vs. SSD Controller based FTL for the PCIe attached flash products. The market leader in this space advocates the former while more current designs take the latter approach.
  2. The continuous algorithmic improvements of the FTL within SSD controllers to minimize write latency and amplification.

There is much work to be done on both fronts with the advent of Triple Level Cell (TLC) flash and the accompanying limitations.

Storage system designers face a different class of issues when many SSD devices are grouped in an enterprise array. The interaction of application, kernel and driver behavior with some features of the FTL in SSD controllers can be detrimental to performance and longevity. An enterprise storage system must consider the FTL as a composite abstraction covering software and hardware behavior within the array’s processing complex as well as the individual SSD devices used for persistent media.

In a thinly provisioned storage system, slabs of address space from the physical disks (or stripes of the physical disks) are doled out on demand. Traditional models would suggest trying to balance this allocation among all SSD devices based on aggregate write history. This may not be correct from a performance standpoint. Writing 100GB to an SSD device as a single 100GB sequential write versus overwriting the same 20GB address space 5 times is not equivalent. The same amount of physical writing to the flash devices has occurred but the latter case will leave the media with significantly more available blocks for future writing. Performance-based optimization can expect slabs allocated from this second case to see better write performance. Ideally, write performance of each SSD device should be tracked to “reverse engineer” the amount of performance lost due to previously written data (consuming spare blocks, incurring write amplification, etc.).

It is commonly believed that write amplification can be avoided by relying on sequential writing, such as a log-based structure. Experimental data shows this not to be the case, even when the application is using uniform block size writes. This can occur because both the operating system and HBA device driver may aggregate sequential writes causing actual writes to be integer multiples of the basic block size. When the same address space is overwritten sequentially without alignment of the actual physical writes, write amplification can occur. Enforcing the application’s block size and alignment may require OS tuning and firmware settings on the HBA when supported by the vendor.

As smaller lithographies and triple level technology enters the market, additional performance issues arise. Today, SSD manufacturers can warranty the life of the drive (sometimes by write throttling) based on the statistical model of flash cell life and overprovisioned space. However, current ECC schemes used in MLC are not as statistically relevant from a performance view. The write and read performance of today’s MLC drive is largely the same during the first year and last year of service as the overhead and statistical probability of corrections is relatively low. This can change dramatically with the wide ECC schemes proposed for TLC flash. Both the probability of hitting errors and the computational requirements for correcting these errors rises dramatically, resulting in constantly degrading performance over the lifetime. Error correction can be designed to deliver deterministic lifetime of TLC flash based on the physics and write cycles, but cannot simultaneously deliver consistent performance over that life.

These issues will be explored in more detail in future postings.