Blog of Ravi Shankar

Saturday, February 18, 2012

BigData

I've been following Bigdata open source projects such as Hadoop for more than 2 years now. Its quit easy to setup Hadoop cluster and implement map reduce job provided you've programming and system admin skills. Its amazing how this technology gained traction in IT world in past year or so. Look at following link on examples on how quantcast processed vast amount of data in few seconds.

http://www.quantcast.com/inside-quantcast/2011/11/turbo-mapreduce/

Medical startup

During past 6 months I was helping a good friend of mine to jump start his Tele-Medicine startup by discussing various ideas and implementing Android UI framework/TI OMAP HW based dongle to connect various medical devices via USB, Ethernet and plain old RS232 interface. I learned a lot in past 6 months and it was quite interesting to discuss business plan and at the same time work on functional prototype. Will summarize my experience in my next blog.

Thursday, June 23, 2011

Disk Power Management

SATA supports three power management states (SLUMBER, PARTIAL and MAX) and these are tunable via linux /sys file system. The specification states when drives are at PARTIAL and SLUMBER states, it need to switch to MAX state in about 10 us to 10 ms. If you add SW overhead the actual time to switch drive from SLUMBER to MAX is much higher. But power savings are considerable and its worth power managing the drives.

Linux Aggressive Link Power Management (ALPM) is a power-saving technique that helps the disk save power by setting a SATA link to the disk to a low-power setting during idle time. ALPM automatically sets the SATA link back to an active power state once I/O requests are queued to that link.

For example, user can set "/sys/class/scsi_host/host*/link_power_management_policy" to min_power, medium_power & max_performance. Each of above states correspond to SLUMBER, PARTIAL and MAX power states.

SATA PHY are based of CMOS digital logic and it uses almost no power in static condition. However the logic consumes power if gates switch logic leading to Higher Speed = More state transitions = Higher Power. Following state diagram describes SATA power transitions states in detail.

Tuesday, June 21, 2011

Linux library "libsas.so"

Underlying storage protocol has vastly changed since original Linux implementation. During early 90's SCSI and ATA were the primary mass storage protocol. As Linux evolved catching with modern storage interfaces like SAS, SATA, FCoE there was some design changes made in the Storage Stack. We will briefly discuss about "libsas.so" which is a shared by Low Level Drivers (LLD) and the SCSI Storage Stack. Why we need libsas interface to begin with ?.

1) SAS interface is dynamic and device can appear and disappear over time. From user/kernel perspective, device node need to be created or deleted dynamically such as mass storage nodes /dev/sdX or generic nodes such as /dev/sgX which are need to access SCSI Enclosure Service (SES) devices

2) SAS Physical layer is similar to Ethernet interface where various physical layer statistics (similar to netstat) can be exported to kernel/user space

3) During early days of SAS 1.0, Expander routing table need to be configured by external HBA/Initiator. This has changed since SAS 2.0 since Expanders are self configurable

Above tasks can be implemented by each LLD in a proprietary way (some HBA does this) or the designer of HBA LLD driver can use common SAS libraries to configure SAS dynamic events outlined above.

There are many good reference texts which outlines Linux SCSI Storage Subsystem in great details. For sake of completeness I've drawn a block diagram which depicts the Linux storage stack.

When new HDD or Expander are added to the SAS fabric these devices broadcast certain packets or primitives. These are received by HBA LLD driver and it need appropriate action/responses based on its contents. For example, if HDD is hot-plugged it will broadcast IDENTIFY packet identifying itself as a Harddisk. If such packet received by the LLD driver, it need to create device node on the file system such as /dev/sdX. So how this information is passed to the kernel ?. Following struct contains various fields (libsas.h) which are populated by LLD with exception of three function pointer identified below

struct sas_ha_struct {
    ..
    ..
    void (*notify_ha_event)(struct sas_ha_struct *, enum ha_event);
    void (*notify_port_event)(struct sas_phy *, enum port_event);
    void (*notify_phy_event)(struct sas_phy *, enum phy_event);
    struct ash_sas_phy **sas_phy; // <--- frame_recvd and sas_prim
    ..
};

When LLD driver registers with kernel using "sas_register_ha(struct sas_ha_struct *)" above function pointer will be valid after successful registration with the kernel.

LLD driver can call (*notify_)() function pointer to be processed by SAS library based on actual events/information received from the SAS fabric. For example, if the SAS PHY is down or broken LLD driver will call (*notify_phy_event)() with appropriate phy event. The SAS library tears down the link and removes appropriate device node in the /dev file system to reflect the state of the SAS fabric.

Wednesday, April 27, 2011

Hotplug

Prior to PCI-Express introduction, hotplug of system, chassis and other components were mainly implemented using proprietary HW. This requires custom HW, Firmware and OS drivers update which are quite messy and prone to various errors/bugs etc. However with introduction of PCISIG standard, entire hotplug operation from HW to Operating System are quiet trivial eliminating proprietary implementation.

On any modern systems, typically you've Root Complex and/or PCI-Express switch connected to CPU. These devices have about 40-80 "x1" PCI Express lanes and can be grouped in units of "x4", "x8" or "x16" lanes. Out of these lanes, only limited lanes or ports can function as PCI Hotplug capable slot. The reason being PCI-E hotplug need about 10+ external discrete signals to function properly. For obvious reasons, not all lanes or ports can function as hotplug due to limited pins in the Root Complex/Switch.

Some of the important hotplug signals are PERST, PRSNT1/2, PWRFLT, MRL and BUTTON. Following diagram below identifies flow of above signals with respect to hot plug devices. All PCI ports supporting hotplug need to implement "Slot capabilities" register space accessible to CPU. This space provides CPU to query whether hotplug device are present, quiesce the bus interface, max power for this slot and whether user has initiated any operation on the hotplug device. Standard hotplug devices typically have BLUE OK2RM, FAULT visual indicators and Attention Button to configure the device. User can press Attention Button and wait until BLUE OK2RM LED is lit before the device can be removed safely. Since PCI-Express is a chatty protocol, surprise removal of hotplug device will result in system panic.

In next blog, I'll discuss on how Operating System (Solais/Linux) and HW work together to configure and unconfigure hotplug devices.

Sunday, March 20, 2011

another Y2K multi core issue

There are many issues faced during Kernel Development as no of CPU's keep increasing in recent years. Recently there are few issues with respect to PIC (Programmable Interrupt Controller) handling no of interrupts. In current Intel architecture, X2APIC are designed to handle about 255 CPU's using 8 bit CPU ID. Since CPU count keep increasing, X2APIC is enhanced to support 32 bit ID's. As you can imagine most of the Operating System designed after Y2K can handle only 8 bit CPU ID. Enhancing BIOS to support 32bit X2APIC breaks most if not all Operating System :-)

Sunday, February 6, 2011

SAS Expander

SAS Expander was introduced in 2006 and it has become an important building block of storage arrays. As the name implies, Expander are like a Ethernet Switch for expanding storage topologies. In theory, you could daisy chain multiple Expanders and build larger storage fabric.

I've identified few issues related to Expanders below.

Topology convergence
SAS/SATA and Expanders can be dynamically plugged into fabric anytime. This generates tons of IDENTIFICATION and BROADCAST primitive which causes HBA and Expander to generate discovery commands. If HBA is overwhelmed with such primitives, it stops executing IO---instead it start servicing primitives!!

QoS
Expanders enforce simple QoS using Open Frame Pathway Block Count and Arbitration Wait timer when multiple Initiators attempting to access storage resource. Expander are essentially stateless and rely on HBA for above information. HBA could gain access to a storage resource by manipulating above value

Resource locking
SAS disks support Disconnect/Reconnect Mode Pages. This is used to limit burst data transfer, duration of Data Phases etc. Multiple SAS initiator could modify/change this page resulting in overriding each other values. SATA/STP protocol does not support any such Mode Pages and STP Initiator can hold the resource using SATA affiliations. In a multi initiator STP environment, there are no way for Expander to implement fairness among all initiators. If a Initiator cannot fully utilize the allocated resource then bandwidth can't be distributed among the other resources in need. There are several algorithms used to achieve max-min fairness in the use of a resource. Many of these algorithms were developed for cell phone and wireless networks; however, they can generally be applied to Expanders. Time Based Regulator (TBR) algorithm based on the leaky bucket scheme are used. The TBR algorithm equally distributes the long-term channel occupancy time among the various sources. The tokens used in the algorithm represent a unit of time and are periodically generated for all sources. Expanders could open/close connections to the Initiator/Target depending on unit of time allocated.

Also due to Class 1 SAS protocol, once I_T_L_Q connection are established, Expander simply pushes incoming SAS/STP frames and primitives to an Egress port. Primitives are 32 bit dwords without any state information such as Source and Destination Address and essentially are used to flow control and manage connections between Initiator and Target devices. When daisy chaining multiple Expanders, typically only x4 or x8 physical links are used. If there are hundreds of Initiator/Target exist in such fabric, the number of active I/O transfers across devices are limited to number of links between Expanders. Also HBA and Target devices start timing out if credit and flow control are not received with 1ms. As primitives and SSP/STP frames propagate through multiple Expanders, considerable latencies are added resulting in I/O timeout. Unlike Ethernet, storage drivers are very sensitive to errors. Once I/O start to timeout, its almost impossible to recover from errors without some sort of reset.