About GridCTO

Founder/CTO at GridIron Systems, Inc. Involved with the world of ever-increasing amount of data - and how it gets refined and used as information.

The Bandwidth Imbalance between Server Memory and Server PCIe Flash

Is flash in the server better suited as:

  1. memory addition or
  2. fast local disk?

While there is excitement in the industry about using PCIe flash as memory, here are some facts to consider:

  • The memory system of a Xeon CPU runs at 10 Million+ IOPS with latencies around 0.1 uS and pushing 30+ GBytes/sec
  • A typical PCI-e flash card performs at 100,000+ random IOPS with latencies around 100uS and pushing 1.5GB/sec

As you can see, adding a flash card to server memory means you are actually REDUCING the speed of memory in the server.  Additionally, Operating Systems don’t know how to deal with a NUMA architecture that’s non-uniform by

  1. 1:10 in access performance
  2. READ/WRITE asymmetry of 1:10 to 1:1000

Sandy Bridge class servers are ratcheting up server I/O demands further making the bandwidth imbalance between server memory and server flash even more drastic.

So faced with the choice of a) adding flash to server memory to slow it down and confuse the OS, or b) adding flash as a very fast local store, the obvious choice is (b)!

This blog post is a summary of a post in the Flash Tech Talk blog.  Read the full original post here: https://talkingflash.wordpress.com/2012/04/20/will-you-feed-me-when-im-64/

Concurrent Bandwidth – The Elephant in The Room Flash Array Vendors Wish You’d Ignore

IOPS. IOPS. IOPS.

It’s the bragging right of the flash SSD world. And vendors go to obsessive lengths to talk about it. Check out the wiki page for IOPS. Note at the bottom of the page and in the edit history of the page how the SSD makers are falling over each other to make sure that the world knows about how many IO operations per second their products can do.

And they report it differently. Let’s review.

Chip Makers

Micron, Hitachi, SanDisk and a few other companies actually make the NAND chips. Easy for them and not so debatable – a chip clearly says it can do so many reads/sec and so many writes/sec. Fun starts when people add their software magic to make SSDs.

SSD appliance makers sometimes quantify the IOPS rating of the controller while other times they simply add up the IOPS rating of the SSDs. But there are some impressive claims.

However, the moment you use SSDs in a shared appliance, what matters most is concurrent bandwidth, not just the raw IOPS. It does not really matter whether you call the offering an enterprise flash array or a SAN/NAS flash storage appliance or a flash memory array – a shared environment requires a LOT more concurrent bandwidth than a dedicated server attached pipe.

A Simple Metric

So, we have SSD appliances in the market with ratings of hundreds of thousands of IOPS. But what about concurrency?

Storage controller designers traditionally did not have to worry about too many concurrent hosts. After all, if all you have is a storage media capable of few hundreds or low thousands of IOPS then what’s the point of sharing them with multiple servers.

On the other hand the raison d’être of SSD appliances are huge amount of IOPS – and attached to a network – they beg to be shared.

A single multi-core server can push bursts of 50,000 IOPS. A blade-center or a pretty pedestrian collection of four servers or a mid-range server (such as a Dell R710) can easily put out burst loads of 200,000 IOPS. And on a SAN or NAS – they are not exactly 512-byte mouselings. Consider these:

  1. Normal file-system buffer size – 4K Bytes – @ 200,000 that is  0.8 GB/sec per server.  For four mid-range servers that is 3.2 GB/sec.
  2. What if it is not just file system work but you have MySQL running on web-layer – default 8K Bytes – now we’re at 6.4GB/sec with 4 servers @ 200K IOPS each.
  3. Vendors eagerly push their SSD appliances for databases. Oracle with table-scans (data warehouse, DSS) will have default block size of 32K Bytes or higher. We’re talking 25GB/sec bursts.
You think a server can not push 6GB/sec of IO in that last example? An Intel Sandy Bridge server IO slot is PCI-gen3x8. Four 16G FC ports  can push that IO easily and the memory bus has enough bandwidth to absorb it.
A shared SAN SSD appliance sees a very different kind of load mix than a PCI-e SSD card or a PCI-e direct-attached dedicated appliance. There is a reason there are NO TPC-H database benchmarks with SAN SSD appliances (as of March 2012). Go ahead. Check it out. There are several benchmarks floating around for direct-attached PCI-e SSD or PCI-e SSD appliances doing well on transaction processing applications. That’s single use – small IO. They hardly get more civilized than that.
Here are the performance specs of the three SAN SSD appliances.
Spec downloaded from respective vendor’s website as of March 28, 2012.
Please let us know if you find any errors or discrepancies and we’ll make every effort to correct them promptly.

The last column is the Cross-section bandwidth of Storage and it is a simple metric obtained by dividing the bandwidth of the connectivity of the storage to the total capacity of the storage. The connectivity bandwidth is either the bandwidth of the network connections or the bandwidth of the flash attachment network – whichever one is the dominant part.

Compare these numbers with another SSD appliance metric:

The guys from the lone-star state make a great product with a proud history and happy users. It’s built like the proverbial brick outhouse and their hardware specs are top-notch.

They were #2 in IOPS/TB and #1 in cross-section bandwidth in this comparison.

The Violin specs are a tie with #1 in IOPS/TB and #2 in cross-section bandwidth. It’s got plenty of network connectivity (8x 8G Fibre Channel) but the MLC array is specified at a lower 2GB/sec. It’s a well balanced design and boasts of some very nitty-gritty details built ground up with loving care.

The rearguard of these three is Pure Storage and the numbers look alarmingly low initially until you look at the foundation technology of this particular vendor. The somewhat low numbers are a direct artifact of using de-dupe/compression to meet you capacity goals.

Coming soon – Server vs. Array Flash – a Suitability Analysis…

PCIe Flash – Part 1

One of the side-benefits of being around for a few decades is the joy of reminiscing about old technology and foibles of the past with friends from said era.

I ran into an old friend I have not seen in over ten years – one thing led to another and we were on the third round of Guinness and somehow the talk veered to the exciting world of flash memory.

And flash trends. Like…
…why a segment of the trade-press have fallen so much in love with PCI-e attached flash memory over the last few years. Pause to consider this for a moment of why PCI-e attached flash is considered the next best thing in computer architecture since Apple…

A. It is PCI-e attached – therefore, it is blindingly fast. Since, everybody knows, being outside the server is “slow”
B. It is on PCI-e – and NOT SATA (shudder – how plebian!). Therefore, it’s not cheap and common. Since everybody KNOWS SATA is slow.

Let’s look at the  two types of PCI-e SSD cards:

  • Single-stage PCI-e SSD: Cards with a PCI-e attached controller directly attached to flash chips
  • Dual-stage PCI-e SSD: Cards with a SAS HBA controller attached to multiple sub-systems – each with a traditional SSD controller attached to flash chips

You can find equally happy users for each type. You can also find users that are blissfully unaware that they have a dual-stage PCI SSD as opposed to single-stage.

Hey – so what if almost 10% of that premium unleaded you are buying @ $5/Gallon for your BMW is essentially industrial corn-hooch (a few cents a gallon in its native form) – it feels like the Ultimate Driving Machine – no?

Single-stage PCI-e SSD

This genre of products essentially started with Fusion-IO’s products.  These types of cards have a controller that attaches to PCI-e on one side and directly to flash devices on the other side.

Take a look (http://www.fusionio.com/platforms/iodrive/)  - 

It has become a true workhorse during its existence.

Another one (this one from Micron)

Again – the same lineup. PCI-e connects to controller that connects to Flash.

The next one here is from Virident – (http://www.virident.com/products/flashmax/)

This card not only has a very impressive list of specifications – it looks pretty stylish, too.

Here are some pros and cons of single-stage PCI-e SSDs:

Pros

Cons

Simplistic Controller Design - Most of the compute and data complexity can be left to the CPU. This is an advantage to the vendor and not necessarily to the end user Server RAM based buffering – These cards use server (host) memory and host CPU cycles to do wear-leveling and as page and data buffers.  The vendor argument here is that the host typically has enough RAM and CPU cycles and typically nobody misses them!
Wide parallel stripe - The controller can operate over much wider channels of flash chips Flash Chip changes force controller spin - As you go from one gen to another – it requires a different controller

Dual-stage PCI-e SSD

This type of PCI-e SSDs usually use a SAS Controller as first stage and then hang multiple SSD subsystems from that SAS/SATA controller. The SSD subsystem functionally looks like bare SATA SSD. That’s the second stage.

LSI SAS controllers are very popular as first stage. They have been around a long time, have great driver support and have mature interfaces up to x8 PCI-Gen2.

What about the second stage controller? Marvel and Sandforce are the two top popular choices. Check out the nifty Sun/Oracle F20 PCI-e Card. This card has a lot of working experience in Oracle environments. The Marvel controllers are visible on the SSD subsystems.

Here is another example from the good folks of OCZ -

This is a well-executed design with four SSD modules.

Below are some advantages of dual-stage designs:

  1. Good DMA and RAID Performance – For users that want to run RAID1 inside ONE single card the first stage controllers provide proven performance. Also, the typical SAS Controllers used in this stage have long and evolved history.
  2. Parallel Operation in Second Stage - The SSD controllers in the second stage operate only smaller numbers of chips at a time and work in parallel.
  3. Better scaling – Larger number of SSD modules can be attached in the second stage, potentially providing scalable performance for large capacity modules.

Coming soon -

The myths, the stats and fables of PCI-e Flash… does it really matter if you have a single or dual-stage card?

Un-structured, structured and relational data – how big is big?

So now that we are definitely in love with big data…how big does it have to be before we really consider it big?

Well…it depends.

Something is really not that big if it’s just sitting there and you are not hauling it around.

See – when Mr. Neumann put down the seminal architecture for programmed computers - he definitely chose sides! The ‘program‘ was the quarterback – and data played a decidedly subservient role – always at the beck and call to be hauled and mauled as programs saw fit.

Programs ‘fetch’ data – at their leisure, at their chosen time.

Even the operating term sounds more fitting for your Corgi than someone or something more serious!

So we have been writing code that merrily ‘fetches’ data and processes it. Works for most programs. Except when data grows. And grows. And grows…

It grows until it starts to be a real problem to just ‘fetch’ it. And it becomes a real pain to move it around. How you have to think about -

  1. perhaps change roles and send the ‘program’ to data instead of the other way around
  2. how to be smart about moving ONLY the required amount of data

For a PC-XT with a whopping 10MB Hard drive big data was just 10MB. That was the entire drive! The little 8088 CPU running at 4.77MHz on a 8-bit bus could scream at 4.77 MB/sec and could finish scanning the disk (theoretically) in less than 2 seconds.

My desktop is running on an i7-2600 CPU with 4 hyperthreaded cores at 3.4 GHz. This beastie can scan my 2TB hard-drive at the rate of a little under 100GB/sec (again, theoretically) – taking 20 seconds to do the scan.

Let’s take a look at that workhorse of enterprise relational data cruncher – Oracle RAC. A state-of-the-art 4-node RAC system should be able to scan in data at the rate of 4 to 5 GB/sec from the storage – or in excess of 20 GB/second. At that rate – a database can load at the rate of 60+TB/hour. Throw in scheduling overhead, network latency, and error checks and you  are looking at 10 to 20 TB/hour. That’s a very impressive number – giving you head-spinning bragging rights in Oracle OpenWorld data warehouse tutorial sessions…

Now consider a 50TB data Warehouse; not too extreme by Oracle standards but now we are talking about 3 hours to JUST LOAD THE DATABASE.

We’re not just ‘fetching’ data anymore, are we?

50TB is “big data” for Oracle RAC, even more so for single-instance Oracle installations.

Even the ne-plus-ultra of NoSQL – Hadoop is not used in isolation. Typically a Hadoop processing stage is followed by HIVE or other structured databases – even MySQL.

So a big ‘unstructured’ data setup may as easily feed into a big ‘structured’ data analysis stage.  So how big do they typically get before the big data characteristics start to show (difficulty in fetching the entire data, sending program to stationary data, etc., etc.). Here is my take:

Hadoop – The top dogs may sneer at something below a petabyte – but in reality a 100TB Hadoop/NoSQL cluster is getting big. You can’t just deal with it casually and it demands attention in care and feeding.

MySQL cluster – A 100 node cluster in the size range of 100TB is certainly getting there.

Oracle including RAC - 50TB and up…especially DSS (Decision Support System) and Warehouses. Folks up in Amazon and Ebay run some very impressive big data warehouses in Oracle. Then there are installations at “those who shall not be named.”

Hadoop is loved because it’s (supposedly) an open ended framework when it comes to data size. Petabytes of data pouring in as concrete? No problem – just add more nodes as your data grows – no need to change your program – the same java code works. But remember the story of war elephants crossing Alps – just because Mr. H. Barca decided to do it does not mean that you should consider it easy. Tilting at a 1,000 node cluster with Hadoop is a day’s work for Google but not for a typical enterprise CIO.

We’ll explore challenges unique to big structured/relational data and big unstructured data in the coming posts…