Blog of Ravi Shankar: January 2011

Tuesday, January 25, 2011

Multi Core CPU and NUMA systems

With introduction of multi core CPU's, performance engineers are being challenged to fine tune systems to get maximum mileage out of them. Its definitely worth to see whether your Operating System supports following important features.

1) Kernel Thread/Process affinity between local processor and memory. NUMA systems means non-uniform memory access wherein the latency to fetch memory on remote processor is much higher. Also processor would spend considerable time generating/servicing TLB flush on such situations resulting in lower performance. Solaris supports Memory Placement Optimization (MPO) for some time now. The kernel is aware of system memory map and allocates/creates kernel thread(s) closer to CPU

2) Make sure NIC and HBA drivers can dynamically bind interrupts to any processor on the system. Modern PCI-Express supports multiple in-band interrupts (MSI) per device. But Operating System need to take advantage by supporting large number of MSI interrupt vectors

3) If you're running Virtual Machines, make sure your Operating System and HW are SRIOV capable. Most Virtual Guests tunnel Network and Storage requests through Control or domain 0. This was OK when system had only few CPU's . But with multi core systems, Operating System performance will suffer heavily without SRIOV. PCI Express fabric errors are handled by Domain 0, while VM Guests can perform IO by manipulating HW directly. Hypervisor typically ignores PCI Config writes when executed by the VM Guests. Watch out for potential Denial of Service (DoS) attack when using SRIOV. This technology allows VM Guests to access piece of HW directly. It may be possible for a errand VM Guests to perform excessive IO potentially pinning PCI Express Bus

Difference between FC, SAS and SATA

I see many in IT/Technology industry that are confused with various storage protocols. Let me explain a bit on history and why these protocols evolved over the period of 30 years.

All of these protocols are block based, that is, they transport/store data in increments of 512 bytes. Some times depending on media (such as CD, DVD) the basic data size could increase to 2048 bytes or higher. Default block size can be altered using SCSI command such as Mode Pages. During early days of PC (80's) Parallel SCSI and IDE are the mostly used storage bus interfaces. Parallel SCSI is a true bus (you could connect or daisy chain multiple devices simultaneously) while IDE was designed for a simpler desktop environment (connect 1 or 2 hard disks or CD/DVD). IDE cables are much smaller than SCSI (< 1m) while SCSI can connect peripherals with about 6m in length (slightly higher with differential cables). Parallel SCSI and IDE/ATA are Class 1 or full duplex protocols (that is two devices need to agree before any command/data transfer can occur).

Fibre Channel was designed to overcome following limitation of Parallel SCSI.

1) Cable length of about 6m
2) Only about 15 Devices can connect between HBA (Initiator) and Target (Disk, CD etc)
3) No arbitration fairness. In SCSI, device with highest ID will always win arbitration
4) Due to Class 1 protocol, only one device can actively communicate over the bus
5) Clock skewing and Signal integrity issues related to Parallel SCSI Bus

Since Parallel SCSI hard disks has high reliability it was deployed in Enterprise and data centers while IDE disks has less reliability so it was was deployed in the desktop systems. One of the major disadvantage of IDE, it is not a true bus. Due to this, only maximum of two devices can be connected between Initiator and Target. However, this was enough for most desktops users.

Some of the key feature of FC.

1) No distance limitation between Server and Storage. Moving storage to centralized location so multiple systems could access simultaneously improving disk utilization
2) There are several class of protocols (Class 1, 2, 3 & 4) though virtually all HBA and Switches support only Class 3 protocol (similar to UDP)
3) Before two device can communicate, it must login and establish credentials (similar to TCP socket). Common login are end device (N-Port) and Fabric Logins (F-port). Once device(s) establish connection, its permanent and no need to re-login again
4) True switched fabric due to Class 3 protocol
5) Partitioning FC fabric into multiple segments for security (Zoning)

SAS and SATA protocol was invented during early 2000's due to design limitation of Parallel SCSI and IDE/ATA interface. SAS was designed as replacement for Parallel SCSI and SATA for IDE interface. Both SAS/SATA use 8b/10b encoding for its physical layer.

SATA still has limitation of connecting only one device to the HBA. However, most of the HBA has about 4-8 ports, so it is possible to connect 8 SATA devices using single HBA. Typically SAS HBA also has 8 ports so in theory it can connect only 8 SAS devices. However, SAS differed from SATA with introduction of SAS Expander which can expand topology (similar to Ethernet Switch) supporting large number of devices. In theory, SAS Expander can support about 64k devices which are adequate for most enterprise deployments. SAS differs from FC as following

1) SAS is Connection Oriented Protocol (Class 1). Both Initiator and Target need to agree before data transfer can occur
2) Unlike FC, SAS supports Control and Data domain protocol. Control domain protocol is used to configure Expanders

Though Expanders are like a FC Switch, SAS is not a true switched fabric due to limitation of Class 1 protocol. There are issues in current SAS Expander which we will discuss in next blog. Some of them are identified below.

1) SAS protocol is blocking in a multi initiator environment. This means, if two devices are communicating, other device(s) must wait before it can initiate data transfer
2) Most SAS primitives (such as RRDY) are transported within connections and its not packetized with Source and Destination address. This prevents SAS Expander becoming true switched fabric
3) No centralized arbitration enforcement

One of the biggest advantage of SAS is that it can connect SATA device to the HBA and/or Expander. SAS protocol is designed to tunnel IDE/ATA protocol using STP (SATA tunneling protocol). This greatly accelerated deployment of SAS and pretty much replaced parallel SCSI.

I see many discussion of SAS replacing FC. But I think this may never happen due to design limitation of SAS discussed above. However, data center don't need another fabric when FC and emerging FCoE enjoy wide spread deployment. SAS will enjoy and continue to expand as a storage fabric eventually replacing FC-AL (Arbitrated Loop) and FC disks. The SAN protocol of choice will still be FC and/or FCoE.

Monday, January 24, 2011

8b/10b Encoding

Almost all modern communication protocols such as PCI-Express, SAS, SATA, Infiniband, FC etc are based on IBM 8b/10b Encoding patent. There are numerous advantages to encoding, such as few motherboard traces, high speed data transfer and achieving DC balance. One of the limitation in 8b/10b Encoding that it has 25% bandwidth overhead, which led designers of PCI Express 3.0 protocol use new encoding scheme --- 128b/130b. As far I know, latest SAS/SATA protocol still use 8b/10b, but it may change someday to new encoding scheme.

Friday, January 7, 2011

FC fractional bandwidth coming back ?

Fiber Channel fractional bandwidth (Class 4) where single FC connections are broken into multiple Virtual Circuits. As far I know there are no real business use case for fractional bandwidth. But I guess this may change with advent of FCoE, 6/12 Gbps SAS/SATA and Solid State disks. The SSD latency is so low, its capable of delivering thousands of IOPS to FC Initiator. When multiple FC initiator are connected to SSD disks there could be business need to provide fine granular QoS.

Thursday, January 6, 2011

Joy of Flash/Solid State Disks

Traditional Storage arrays use combination of SAS/SATA disks to store data. However, if you look at the performance of these disks, its really very poor, Let's compare

SAS (Serial Attached SCSI) ~200 IOPS
SATA (Serial ATA) ~100-150 IOPS

Typically, you get above IOPS for block size of 4-8k.

But wait, storage array performance is much higher than raw SAS/SATA IOPS. How do they achieve high performance ?. The answer is Cache. The storage array use combination of Non Volatile RAM and flash disks to buffer data so I/O completed quickly with minimal latency. This architecture works fine with magnetic disks but will start to create problem when we transition to flash/Solid State disks as primary storage. Why ?. The flash/SSD Read/Write latency is so low it does not make sense to cache data any more. It will be interesting to see how storage vendors solve this bottleneck.