Coding for distributed networked storage systems


Storage and back-up of data are vital in today's society. With the rapid increase of the amount of information and data in digital format, reliable storage of data poses significant challenges despite the advances in storage device technologies. Cheap (and even expensive) individual storage devices are bound to fail, and hence redundancy mechanisms are essential in order to achieve reliable storage. Coding techniques promise economical and effective redundancy mechanisms for data storage. Use of existing coding techniques (such as Reed-Solomon erasure codes) for data storage is widespread, most successfully adapted for storage in media like CDs/DVDs (as well as RAID like systems).
We argue that the nuances of distributed networked storage (e.g., failure of some storage devices call for recreation and storage of lost redundancy in new storage devices, unlike a scratch on a CD, where we only can deal with the erasures, but can not replenish the losses) call for novel coding techniques tailor-made for distributed networked storage systems. We study the full spectrum of the principles and theory of novel codes suited for networked storage systems, along with the practical aspects such as algorithms and distributed networked storage system design issues in order to harness the salient properties of these new coding techniques.


Self-repairing codes (SRC): Foundational theory


Erasure codes provide a storage efficient alternative to replication based redundancy in (networked) storage systems. They however entail high communication overhead for maintenance, when some of the encoded fragments are lost and need to be replenished. Such overheads arise from the fundamental need to recreate (or keep separately) first a copy of the whole object before any individual encoded fragment can be generated and replenished. We propose as an alternative a new family of codes to improve the maintenance process, which we call self-repairing codes (SRC), with the following salient features: (i) encoded fragments can be repaired directly from other subsets of encoded fragments without having to reconstruct first the original data, ensuring that (ii) a fragment is repaired from a fixed number of encoded fragments, the number depending only on how many encoded blocks are missing and independent of which specific blocks are missing. These properties allow for not only low communication overhead to recreate a missing fragment, but also independent reconstruction of different missing fragments in parallel, possibly in different parts of the network.

Coding: As of May 2011, there are two distinct families of codes satisfying the cardinal self-repairing properties described above, namely (i) Homomorphic Self-Repairing Codes [HSRC] and (ii) Projective Geometric Self-Repairing Codes [PSRC].

HSRC [1,4] were introduced as a first proof of existence of codes having self-repairing properties, and are based on the concept of linearized polynomials. Each node stores one piece of encoded data, and as long as half of the nodes are live, it is possible to repair each failed node by contacting only two lives nodes, which is optimal. Moreover, for a specific failed node, there are many options for choosing a pair of nodes to recreate it.

PSRC [2] are built on spreads, a notion coming from projective geometry. They require each node to store several pieces of encoded data, and besides satisfying all the properties satisfied by HSRC, they provide a stronger self-repairing property, in that, to regenerate a specific failed node, if one live node is first chosen, then it can typically be paired with several other live nodes. The PSRC family of codes has the additional promising property of storing systematic pieces of the object, which can just be concatenated to reconstruct the object without any other decoding needs.

Decoding: Work in progress, TBA ...

Applied research and practical studies with SRCs


Repair of multiple objects: The SRC code properties provide adequate insight on how repairs can be carried out for an object in isolation. In practice, many objects are stored in a storage system, and likewise failures simultaneously affect multiple objects. Thus, there will naturally be contention for shared resources (e.g., I/O, network) during the repairs, and other system design choices such as placement of the encoded objects can influence the repair process. We have carried out some preliminary experiments to benchmark repair of multiple objects under single as well as multiple storage node failures.

In-network redundancy generation: During the storage process, traditional erasure codes require a unique source node to create and upload all the redundant data to the different storage nodes. However, such a source node may have limited communication and computation capabilities, which constrain the storage process throughput. Moreover, the source node and the different storage nodes might not be able to send and receive data simultaneously - e.g., nodes might be busy in a datacenter setting, or simply be offline in a peer-to-peer setting - which can further threaten the efficacy of the overall storage process. SRCs can be leveraged to achieve "in-network" redundancy generation [6]. Such in-network redundancy generation allows storage nodes to generate new redundant data by exchanging partial information among themselves, improving the throughput of the storage process. The process is carried out asynchronously, utilizing spare bandwidth and computing resources from the storage nodes. Experiments show that such an approach can increase the throughput of the storage process significantly with respect to the classical naive storage approach.

Regenerating codes (RGC)


Dimakis et al. derived a trade-off between storage capacity and bandwidth needed for repair, computed using network coding techniques. Codes achieving this trade-off have been called regenerating codes (RGC). This analysis was generalized by Kermarrec et al., yielding a more general class of RGC, taking into account collaboration among regenerating nodes. It was shown that collaboration benefits the trade-off curve. In [3] we study Byzantine fault tolerance of regenerating codes with collaboration, and show that in the presence of nodes polluting the network with corrupted data, the trade-off curve is actually worse than without any collaboration.

Testbeds


Proprietary cluster (using HP ThinClients): In order to carry out small scale deployments and experiments, we have set up a cluster composed of 50 HP t5745 ThinClients, each with an Intel Atom N280 Processor 1.66 GHz, 1 GB DDR3 SDRAM and a solid state drive (SSD) of 2 GB. A PC with 4GB RAM is used as the master node orchestrating the cluster. The network connectivity is provided by three 24-port HP 2510 Switches (J9019B) and the whole setup has a maximum power consumption of 3.31 kW.



High performance Hadoop setup (from IBM): Details TBA ...

Publications


  1. Self-repairing Homomorphic Codes for Distributed Storage Systems
    Frédérique Oggier, Anwitaman Datta
    Infocom 2011, The 30th IEEE International Conference on Computer Communications.
  2. Self-Repairing Codes for Distributed Storage - A Projective Geometric Construction
    IEEE Information Theory Workshop (ITW) 2011, Frédérique Oggier, Anwitaman Datta
  3. Byzantine Fault Tolerance of Regenerating Codes
    Frédérique Oggier, Anwitaman Datta
    P2P 2011, The 11th IEEE International Conference on Peer-to-Peer Computing.
  4. Homomorphic Self-repairing Codes for Agile Maintenance of Distributed Storage Systems
    Frédérique Oggier, Anwitaman Datta
    Manuscript on arXiv, July 2011.
  5. An Overview of Codes Tailor-made for Networked Distributed Data Storage
    Anwitaman Datta, Frédérique Oggier
    Manuscript on arXiv, September 2011.
  6. In-Network Redundancy Generation for Opportunistic Speedup of Backup
    Lluis Pamies-Juarez, Anwitaman Datta, Frédérique Oggier
    Manuscript on arXiv, November 2011.

Talks