Blog of Ravi Shankar: 2011

Thursday, June 23, 2011

Disk Power Management

SATA supports three power management states (SLUMBER, PARTIAL and MAX) and these are tunable via linux /sys file system. The specification states when drives are at PARTIAL and SLUMBER states, it need to switch to MAX state in about 10 us to 10 ms. If you add SW overhead the actual time to switch drive from SLUMBER to MAX is much higher. But power savings are considerable and its worth power managing the drives.

Linux Aggressive Link Power Management (ALPM) is a power-saving technique that helps the disk save power by setting a SATA link to the disk to a low-power setting during idle time. ALPM automatically sets the SATA link back to an active power state once I/O requests are queued to that link.

For example, user can set "/sys/class/scsi_host/host*/link_power_management_policy" to min_power, medium_power & max_performance. Each of above states correspond to SLUMBER, PARTIAL and MAX power states.

SATA PHY are based of CMOS digital logic and it uses almost no power in static condition. However the logic consumes power if gates switch logic leading to Higher Speed = More state transitions = Higher Power. Following state diagram describes SATA power transitions states in detail.

Tuesday, June 21, 2011

Linux library "libsas.so"

Underlying storage protocol has vastly changed since original Linux implementation. During early 90's SCSI and ATA were the primary mass storage protocol. As Linux evolved catching with modern storage interfaces like SAS, SATA, FCoE there was some design changes made in the Storage Stack. We will briefly discuss about "libsas.so" which is a shared by Low Level Drivers (LLD) and the SCSI Storage Stack. Why we need libsas interface to begin with ?.

1) SAS interface is dynamic and device can appear and disappear over time. From user/kernel perspective, device node need to be created or deleted dynamically such as mass storage nodes /dev/sdX or generic nodes such as /dev/sgX which are need to access SCSI Enclosure Service (SES) devices

2) SAS Physical layer is similar to Ethernet interface where various physical layer statistics (similar to netstat) can be exported to kernel/user space

3) During early days of SAS 1.0, Expander routing table need to be configured by external HBA/Initiator. This has changed since SAS 2.0 since Expanders are self configurable

Above tasks can be implemented by each LLD in a proprietary way (some HBA does this) or the designer of HBA LLD driver can use common SAS libraries to configure SAS dynamic events outlined above.

There are many good reference texts which outlines Linux SCSI Storage Subsystem in great details. For sake of completeness I've drawn a block diagram which depicts the Linux storage stack.

When new HDD or Expander are added to the SAS fabric these devices broadcast certain packets or primitives. These are received by HBA LLD driver and it need appropriate action/responses based on its contents. For example, if HDD is hot-plugged it will broadcast IDENTIFY packet identifying itself as a Harddisk. If such packet received by the LLD driver, it need to create device node on the file system such as /dev/sdX. So how this information is passed to the kernel ?. Following struct contains various fields (libsas.h) which are populated by LLD with exception of three function pointer identified below

struct sas_ha_struct {
    ..
    ..
    void (*notify_ha_event)(struct sas_ha_struct *, enum ha_event);
    void (*notify_port_event)(struct sas_phy *, enum port_event);
    void (*notify_phy_event)(struct sas_phy *, enum phy_event);
    struct ash_sas_phy **sas_phy; // <--- frame_recvd and sas_prim
    ..
};

When LLD driver registers with kernel using "sas_register_ha(struct sas_ha_struct *)" above function pointer will be valid after successful registration with the kernel.

LLD driver can call (*notify_)() function pointer to be processed by SAS library based on actual events/information received from the SAS fabric. For example, if the SAS PHY is down or broken LLD driver will call (*notify_phy_event)() with appropriate phy event. The SAS library tears down the link and removes appropriate device node in the /dev file system to reflect the state of the SAS fabric.

Wednesday, April 27, 2011

Hotplug

Prior to PCI-Express introduction, hotplug of system, chassis and other components were mainly implemented using proprietary HW. This requires custom HW, Firmware and OS drivers update which are quite messy and prone to various errors/bugs etc. However with introduction of PCISIG standard, entire hotplug operation from HW to Operating System are quiet trivial eliminating proprietary implementation.

On any modern systems, typically you've Root Complex and/or PCI-Express switch connected to CPU. These devices have about 40-80 "x1" PCI Express lanes and can be grouped in units of "x4", "x8" or "x16" lanes. Out of these lanes, only limited lanes or ports can function as PCI Hotplug capable slot. The reason being PCI-E hotplug need about 10+ external discrete signals to function properly. For obvious reasons, not all lanes or ports can function as hotplug due to limited pins in the Root Complex/Switch.

Some of the important hotplug signals are PERST, PRSNT1/2, PWRFLT, MRL and BUTTON. Following diagram below identifies flow of above signals with respect to hot plug devices. All PCI ports supporting hotplug need to implement "Slot capabilities" register space accessible to CPU. This space provides CPU to query whether hotplug device are present, quiesce the bus interface, max power for this slot and whether user has initiated any operation on the hotplug device. Standard hotplug devices typically have BLUE OK2RM, FAULT visual indicators and Attention Button to configure the device. User can press Attention Button and wait until BLUE OK2RM LED is lit before the device can be removed safely. Since PCI-Express is a chatty protocol, surprise removal of hotplug device will result in system panic.

In next blog, I'll discuss on how Operating System (Solais/Linux) and HW work together to configure and unconfigure hotplug devices.

Sunday, March 20, 2011

another Y2K multi core issue

There are many issues faced during Kernel Development as no of CPU's keep increasing in recent years. Recently there are few issues with respect to PIC (Programmable Interrupt Controller) handling no of interrupts. In current Intel architecture, X2APIC are designed to handle about 255 CPU's using 8 bit CPU ID. Since CPU count keep increasing, X2APIC is enhanced to support 32 bit ID's. As you can imagine most of the Operating System designed after Y2K can handle only 8 bit CPU ID. Enhancing BIOS to support 32bit X2APIC breaks most if not all Operating System :-)

Sunday, February 6, 2011

SAS Expander

SAS Expander was introduced in 2006 and it has become an important building block of storage arrays. As the name implies, Expander are like a Ethernet Switch for expanding storage topologies. In theory, you could daisy chain multiple Expanders and build larger storage fabric.

I've identified few issues related to Expanders below.

Topology convergence
SAS/SATA and Expanders can be dynamically plugged into fabric anytime. This generates tons of IDENTIFICATION and BROADCAST primitive which causes HBA and Expander to generate discovery commands. If HBA is overwhelmed with such primitives, it stops executing IO---instead it start servicing primitives!!

QoS
Expanders enforce simple QoS using Open Frame Pathway Block Count and Arbitration Wait timer when multiple Initiators attempting to access storage resource. Expander are essentially stateless and rely on HBA for above information. HBA could gain access to a storage resource by manipulating above value

Resource locking
SAS disks support Disconnect/Reconnect Mode Pages. This is used to limit burst data transfer, duration of Data Phases etc. Multiple SAS initiator could modify/change this page resulting in overriding each other values. SATA/STP protocol does not support any such Mode Pages and STP Initiator can hold the resource using SATA affiliations. In a multi initiator STP environment, there are no way for Expander to implement fairness among all initiators. If a Initiator cannot fully utilize the allocated resource then bandwidth can't be distributed among the other resources in need. There are several algorithms used to achieve max-min fairness in the use of a resource. Many of these algorithms were developed for cell phone and wireless networks; however, they can generally be applied to Expanders. Time Based Regulator (TBR) algorithm based on the leaky bucket scheme are used. The TBR algorithm equally distributes the long-term channel occupancy time among the various sources. The tokens used in the algorithm represent a unit of time and are periodically generated for all sources. Expanders could open/close connections to the Initiator/Target depending on unit of time allocated.

Also due to Class 1 SAS protocol, once I_T_L_Q connection are established, Expander simply pushes incoming SAS/STP frames and primitives to an Egress port. Primitives are 32 bit dwords without any state information such as Source and Destination Address and essentially are used to flow control and manage connections between Initiator and Target devices. When daisy chaining multiple Expanders, typically only x4 or x8 physical links are used. If there are hundreds of Initiator/Target exist in such fabric, the number of active I/O transfers across devices are limited to number of links between Expanders. Also HBA and Target devices start timing out if credit and flow control are not received with 1ms. As primitives and SSP/STP frames propagate through multiple Expanders, considerable latencies are added resulting in I/O timeout. Unlike Ethernet, storage drivers are very sensitive to errors. Once I/O start to timeout, its almost impossible to recover from errors without some sort of reset.

Tuesday, January 25, 2011

Multi Core CPU and NUMA systems

With introduction of multi core CPU's, performance engineers are being challenged to fine tune systems to get maximum mileage out of them. Its definitely worth to see whether your Operating System supports following important features.

1) Kernel Thread/Process affinity between local processor and memory. NUMA systems means non-uniform memory access wherein the latency to fetch memory on remote processor is much higher. Also processor would spend considerable time generating/servicing TLB flush on such situations resulting in lower performance. Solaris supports Memory Placement Optimization (MPO) for some time now. The kernel is aware of system memory map and allocates/creates kernel thread(s) closer to CPU

2) Make sure NIC and HBA drivers can dynamically bind interrupts to any processor on the system. Modern PCI-Express supports multiple in-band interrupts (MSI) per device. But Operating System need to take advantage by supporting large number of MSI interrupt vectors

3) If you're running Virtual Machines, make sure your Operating System and HW are SRIOV capable. Most Virtual Guests tunnel Network and Storage requests through Control or domain 0. This was OK when system had only few CPU's . But with multi core systems, Operating System performance will suffer heavily without SRIOV. PCI Express fabric errors are handled by Domain 0, while VM Guests can perform IO by manipulating HW directly. Hypervisor typically ignores PCI Config writes when executed by the VM Guests. Watch out for potential Denial of Service (DoS) attack when using SRIOV. This technology allows VM Guests to access piece of HW directly. It may be possible for a errand VM Guests to perform excessive IO potentially pinning PCI Express Bus

Difference between FC, SAS and SATA

I see many in IT/Technology industry that are confused with various storage protocols. Let me explain a bit on history and why these protocols evolved over the period of 30 years.

All of these protocols are block based, that is, they transport/store data in increments of 512 bytes. Some times depending on media (such as CD, DVD) the basic data size could increase to 2048 bytes or higher. Default block size can be altered using SCSI command such as Mode Pages. During early days of PC (80's) Parallel SCSI and IDE are the mostly used storage bus interfaces. Parallel SCSI is a true bus (you could connect or daisy chain multiple devices simultaneously) while IDE was designed for a simpler desktop environment (connect 1 or 2 hard disks or CD/DVD). IDE cables are much smaller than SCSI (< 1m) while SCSI can connect peripherals with about 6m in length (slightly higher with differential cables). Parallel SCSI and IDE/ATA are Class 1 or full duplex protocols (that is two devices need to agree before any command/data transfer can occur).

Fibre Channel was designed to overcome following limitation of Parallel SCSI.

1) Cable length of about 6m
2) Only about 15 Devices can connect between HBA (Initiator) and Target (Disk, CD etc)
3) No arbitration fairness. In SCSI, device with highest ID will always win arbitration
4) Due to Class 1 protocol, only one device can actively communicate over the bus
5) Clock skewing and Signal integrity issues related to Parallel SCSI Bus

Since Parallel SCSI hard disks has high reliability it was deployed in Enterprise and data centers while IDE disks has less reliability so it was was deployed in the desktop systems. One of the major disadvantage of IDE, it is not a true bus. Due to this, only maximum of two devices can be connected between Initiator and Target. However, this was enough for most desktops users.

Some of the key feature of FC.

1) No distance limitation between Server and Storage. Moving storage to centralized location so multiple systems could access simultaneously improving disk utilization
2) There are several class of protocols (Class 1, 2, 3 & 4) though virtually all HBA and Switches support only Class 3 protocol (similar to UDP)
3) Before two device can communicate, it must login and establish credentials (similar to TCP socket). Common login are end device (N-Port) and Fabric Logins (F-port). Once device(s) establish connection, its permanent and no need to re-login again
4) True switched fabric due to Class 3 protocol
5) Partitioning FC fabric into multiple segments for security (Zoning)

SAS and SATA protocol was invented during early 2000's due to design limitation of Parallel SCSI and IDE/ATA interface. SAS was designed as replacement for Parallel SCSI and SATA for IDE interface. Both SAS/SATA use 8b/10b encoding for its physical layer.

SATA still has limitation of connecting only one device to the HBA. However, most of the HBA has about 4-8 ports, so it is possible to connect 8 SATA devices using single HBA. Typically SAS HBA also has 8 ports so in theory it can connect only 8 SAS devices. However, SAS differed from SATA with introduction of SAS Expander which can expand topology (similar to Ethernet Switch) supporting large number of devices. In theory, SAS Expander can support about 64k devices which are adequate for most enterprise deployments. SAS differs from FC as following

1) SAS is Connection Oriented Protocol (Class 1). Both Initiator and Target need to agree before data transfer can occur
2) Unlike FC, SAS supports Control and Data domain protocol. Control domain protocol is used to configure Expanders

Though Expanders are like a FC Switch, SAS is not a true switched fabric due to limitation of Class 1 protocol. There are issues in current SAS Expander which we will discuss in next blog. Some of them are identified below.

1) SAS protocol is blocking in a multi initiator environment. This means, if two devices are communicating, other device(s) must wait before it can initiate data transfer
2) Most SAS primitives (such as RRDY) are transported within connections and its not packetized with Source and Destination address. This prevents SAS Expander becoming true switched fabric
3) No centralized arbitration enforcement

One of the biggest advantage of SAS is that it can connect SATA device to the HBA and/or Expander. SAS protocol is designed to tunnel IDE/ATA protocol using STP (SATA tunneling protocol). This greatly accelerated deployment of SAS and pretty much replaced parallel SCSI.

I see many discussion of SAS replacing FC. But I think this may never happen due to design limitation of SAS discussed above. However, data center don't need another fabric when FC and emerging FCoE enjoy wide spread deployment. SAS will enjoy and continue to expand as a storage fabric eventually replacing FC-AL (Arbitrated Loop) and FC disks. The SAN protocol of choice will still be FC and/or FCoE.

Monday, January 24, 2011

8b/10b Encoding

Almost all modern communication protocols such as PCI-Express, SAS, SATA, Infiniband, FC etc are based on IBM 8b/10b Encoding patent. There are numerous advantages to encoding, such as few motherboard traces, high speed data transfer and achieving DC balance. One of the limitation in 8b/10b Encoding that it has 25% bandwidth overhead, which led designers of PCI Express 3.0 protocol use new encoding scheme --- 128b/130b. As far I know, latest SAS/SATA protocol still use 8b/10b, but it may change someday to new encoding scheme.

Friday, January 7, 2011

FC fractional bandwidth coming back ?

Fiber Channel fractional bandwidth (Class 4) where single FC connections are broken into multiple Virtual Circuits. As far I know there are no real business use case for fractional bandwidth. But I guess this may change with advent of FCoE, 6/12 Gbps SAS/SATA and Solid State disks. The SSD latency is so low, its capable of delivering thousands of IOPS to FC Initiator. When multiple FC initiator are connected to SSD disks there could be business need to provide fine granular QoS.

Thursday, January 6, 2011

Joy of Flash/Solid State Disks

Traditional Storage arrays use combination of SAS/SATA disks to store data. However, if you look at the performance of these disks, its really very poor, Let's compare

SAS (Serial Attached SCSI) ~200 IOPS
SATA (Serial ATA) ~100-150 IOPS

Typically, you get above IOPS for block size of 4-8k.

But wait, storage array performance is much higher than raw SAS/SATA IOPS. How do they achieve high performance ?. The answer is Cache. The storage array use combination of Non Volatile RAM and flash disks to buffer data so I/O completed quickly with minimal latency. This architecture works fine with magnetic disks but will start to create problem when we transition to flash/Solid State disks as primary storage. Why ?. The flash/SSD Read/Write latency is so low it does not make sense to cache data any more. It will be interesting to see how storage vendors solve this bottleneck.