The volume of data being created and stored is phenomenal and the rate of growth is accelerating. Articles such as Straub’s 2004, “The Data Tsunami: A Perspective on Data Storage,” and Kramer’s 2004, “Deep scientific computing requires deep data,” bring attention to the challenges of managing data growth. (Kramer, et al., 2004; Straub, 2004) The University of California at Berkeley’s School of Information Management and Systems study, “How much data? 2003,” provides further definition to the magnitude of the data management challenge. (Lyman, et al., 2003) Managing the storage demand and costs are significant challenges.
Lyman et al estimate that as much as 3.2 exabytes of original digital data was created from 1999 to 2000. (Lyman, Varian, Dunn, Strygin, & Swearingen, 2000; Lyman, et al., 2003) In 2002, Lyman et al estimate that as much as 5.4 exabytes of digital data was created. (Lyman, et al., 2000; Lyman, et al., 2003) In contrast, Lyman et al estimate that annual production of paper based information was as much as 6.3 petabytes from 1999 to 2000 and 6.8 petabytes in 2002. (Lyman, et al., 2000; Lyman, et al., 2003) Gutpa and Pegah, in their 2007 article, “A new thought paradigm: delivering cost effective and ubiquitously accessible storage with enterprise backup system via a multi-tiered storage framework,” site a study by International Data Corporation that estimates data growth at 80% per year with an estimate of 600 exabytes total stored data in 2010. (Gupta & Pegah, 2007)
Gupta and Pegah reference a 2006 survey by Monosphere and conclude that storage costs are also a significant challenge,
A survey conducted by Monosphere in Dec. 2006, illustrates that capital expenditures on storage are rapidly escalating and are causing companies to delay other important IT initiatives. While 62 percent of the total responded that increased storage spending has created budgetary problems for IT, the number escalated to 87 percent – more than two thirds – when the dataset was narrowed to include only executives titled director level and above who manage more than 100 terabytes of stored data. Also among the smaller dataset, responses reveal that 36 percent had to delay other IT projects due to the increase in storage expenditures. (Gupta & Pegah, 2007)
One might expect decreasing costs for disks to offset enterprise storage costs, since the costs of disk have steadily declined and the capacities have steadily increased. However, the growth in new data combined with the complexity of managing large volumes of data offsets and savings from cheaper parts. From the 2005 article, “The Evolution of Storage Service Providers: Techniques and Challenges to Outsourcing Storage,” by Hasan et al, “According to a Gartner group study, the cost of managing data protection and storage is 5 to 7 times the cost of hardware, and 74% of the total storage related costs.” (Hasan, Yurcik, & Myagmar, 2005)
Storage Efficiency Technologies
Management of storage costs and growth can be achieved through various combinations of policy and technology. One technological means for managing storage growth is to apply new or different technologies to the storage infrastructure upon which data services operate. By improving the efficiency of storage sub-systems, the overall information system can be improved to enable cost-effective and scalable data services. There are three major technological areas of research that focus on storage efficiency: hierarchical storage management and archiving, data de-duplication and compression, and grid storage.
Hierarchical Storage Management and Data Archiving
Hierarchical storage management (HSM) is an industry term for a system of storage management that matches data properties such as the last time a file was accessed to cost and performance differentiated tiers of storage. Gupta and Pegah define HSM as, “a data storage methodology which automatically moves data between expensive and inexpensive storage media.” (Gupta & Pegah, 2007) Gupta and Pegah describe the implementation of a three-tier HSM system. The authors describe the benefits of an HSM system,
“The total cost per unit of storage in tiered system is much smaller than a single tiered system. There is also almost no loss in performance. We have also been able to increase user quotas by orders of magnitude because of the more economical storage infrastructure.” (Gupta & Pegah, 2007)
Kaczmarski et al produced similar results with a different HSM solution and case study. (Kaczmarski, Jiang, & Pease, 2003) Kaczmarski and his colleagues found that differentiated storage tiers enabled storage policies more closely aligned with “higher level business goals.” (Kaczmarski, et al., 2003)
Data archiving and HSM are complementary technologies. Data archives can be a tier within and HSM solution. Data archiving technology enables long-term storage of information, usually in the form of write once read many (WORM) media. Medical records that must be kept for very long periods, sometimes up to 90 years, are an example of data that is best suited for archiving. (Roussos, 2007) Research such as the archival system proposed by Quinlan and Dorward in 2002 seeks to put forth increasingly more efficient archival solutions. (Quinlan & Dorward, 2002) The media costs for a WORM system like the one proposed by Quinlan and Doward are very low. (Quinlan & Dorward, 2002)
HSM and data archiving technologies continue to advance. As with server and network technologies, storage virtualization has the potential to improve utilization and returns on investments (ROI). Faibish et al report on research into file system optimization when using virtualized storage devices. (Faibish, Fridella, Bixby, & Gupta, 2008) By abstracting the physical storage devices using virtualization technologies, disk utilization may be increased and, therefore, ROI improved. However, as Faibish et al report, file systems must be optimized for virtual storage devices or else performance will suffer. (Faibish, et al., 2008) Roussos describes a specific storage vendor’s focus on storage virtualization as a strategic technology. (Roussos, 2007) Roussos describes storage virtualization that integrates policy-based automation as an evolutionary technology descended from and improving on HSM. (Roussos, 2007)
Data De-duplication and Compression
The fundamental concepts that enable de-duplication and compression of data have been in place since the beginning of the Information Systems discipline. D’Imperio describes “computed, content-derived address-assignment” in his 1969 paper, “Information Structures: Tools in Problem Solving.” (D’Imperio, 1969) Storer defines de-duplication thus,
“Deduplication identifies common sequences of bytes both within and between files (“chunks”), and only stores a single instance of each chunk regardless of the number of times it occurs. By doing so, deduplication can dramatically reduce the space needed to store a large data set.” (Storer, Greenan, Long, & Miller, 2008)
Currently available de-duplication storage systems utilize file-level or block-level hashing to de-duplicate data. While effective at improving storage efficiency, there are solutions being researched that promise to provide further efficiencies. Aronovich et al in the 2009 article, “The Design of a Similarity Based Deduplication System,” extends upon Bobbarjung, Jagannathan, and Dubnicki’s 2006 research into similarity-based de-duplication with an analysis of a real world implementation. (Aronovich, et al., 2009; Bobbarjung, Jagannathan, & Dubnicki, 2006) Similarity-based de-duplication utilizes variable-length bit masks to find more granular chunks to de-duplicate than block- and file-based de-duplication. Aronovich et al report a 40:1 reduction in data size utilizing the similarity-based technology. (Aronovich, et al., 2009)
Another technique for reducing the on-disk footprint of stored data is compression. Kothiyal et al analyzed the effectiveness of data compression in context with environmental impact, but were not able to draw any conclusions. (Kothiyal, Tarasov, Sehgal, & Zadok, 2009) Lee et al reported on comparisons of various data mining algorithms that can be used as data compression for databases (Lee, Changchien, Wang, & Shen, 2006) Lee et al found they could achieve significant compression ratios, up to 50%, with their proposed algorithms. (Lee, et al., 2006 Shen, 2006) While not as dramatic as the results from similarity-based de-duplication, compression does improve the physical footprint of stored data.
Grid storage solutions seek to solve data management challenges by leveraging lessons learned from the implementation and operation of large commodity compute clusters. As Deng et al describe, the key attributes of grid storage are, “scalability, heterogeneity, and interoperability.” (Deng & Wang, 2007) Huang et al’s 2005 article, “Data Grid for Large-Scale Medical Image Archive and Analysis,” presents a case study implementation of a data grid for storing and processing medical imagery. (Huang, et al., 2005) The case study describe by Huang et al represents a successful implementation of grid storage that earned additional benefits for the company involved because storage grid resources were leveraged as compute resources, thereby providing increased ROI. (Huang, et al., 2005)
Storage grids, cloud storage, and utility storage all represent a common approach to provisioning and management of storage resources. Eyers et al report on a proposed middleware architecture to enable storage consumers to utilize multiple cloud storage vendors in a uniform way. (Eyers, Routray, Zhang, Willcocks, & Pietzuch, 2009)
Exponential storage growth, increasing storage costs, and static or shrinking Information Technology (IT) budgets combine to present a daunting challenge to any business leader responsible for data storage. Ongoing research in the areas of hierarchical storage management, data de-duplication, and grid storage offers technological means to improve storage efficiency and enable successful management of information. IT leaders facing these information management challenges should investigate the aforementioned technologies as a means to improve storage efficiency.
Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., & Klein, S. T. (2009). The design of a similarity based deduplication system. Paper presented at the Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference.
Bobbarjung, D. R., Jagannathan, S., & Dubnicki, C. (2006). Improving duplicate elimination in storage systems. Trans. Storage, 2(4), 424-448.
D’Imperio, M. E. (1969). Information structures: tools in problem solving. SIGMOD Rec., 1(2), 25-51.
Deng, Y., & Wang, F. (2007). Opportunities and challenges of storage grid enabled by grid service. SIGOPS Oper. Syst. Rev., 41(4), 79-82.
Eyers, D. M., Routray, R., Zhang, R., Willcocks, D., & Pietzuch, P. (2009). Towards a middleware for configuring large-scale storage infrastructures. Paper presented at the Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science.
Faibish, S., Fridella, S., Bixby, P., & Gupta, U. (2008). Storage virtualization using a block-device file system. SIGOPS Oper. Syst. Rev., 42(1), 119-126.
Gupta, P., & Pegah, M. (2007). A new thought paradigm: delivering cost effective and ubiquitously accessible storage with enterprise backup system via a multi-tiered storage framework. Paper presented at the Proceedings of the 35th annual ACM SIGUCCS conference on User services.
Hasan, R., Yurcik, W., & Myagmar, S. (2005). The evolution of storage service providers: techniques and challenges to outsourcing storage. Paper presented at the Proceedings of the 2005 ACM workshop on Storage security and survivability.
Huang, H. K., Zhang, A., Liu, B., Zhou, Z., Documet, J., King, N., et al. (2005). Data grid for large-scale medical image archive and analysis. Paper presented at the Proceedings of the 13th annual ACM international conference on Multimedia.
Kaczmarski, M., Jiang, T., & Pease, D. A. (2003). Beyond backup toward storage management. IBM Systems Journal, 42(2), 1.
Kothiyal, R., Tarasov, V., Sehgal, P., & Zadok, E. (2009). Energy and performance evaluation of lossless file data compression on server systems. Paper presented at the Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference.
Kramer, W. T. C., Shoshani, A., Agarwal, D. A., Draney, B. R., Jin, G., Butler, G. F., et al. (2004). Deep scientific computing requires deep data. IBM Journal of Research & Development, 48(2), 209-232.
Lee, C.-F., Changchien, S. W., Wang, W.-T., & Shen, J.-J. (2006). A data mining approach to database compression. Information Systems Frontiers, 8(3), 14.
Lyman, P., Varian, H. R., Dunn, J., Strygin, A., & Swearingen, K. (2000). How Much Information? 2000. Berkeley, CA: University of California at Berkeley.
Lyman, P., Varian, H. R., Good, C., Good, N., Jordan, L. L., & Pal, J. (2003). How Much Information? 2003. Berkeley, CA: University of California at Berkeley.
Quinlan, S., & Dorward, S. (2002). Awarded Best Paper! – Venti: A New Approach to Archival Data Storage. Paper presented at the Proceedings of the 1st USENIX Conference on File and Storage Technologies.
Roussos, K. (2007). Storage Virtualization Gets Smart. Queue, 5(6), 38-44.
Storer, M. W., Greenan, K., Long, D. D. E., & Miller, E. L. (2008). Secure data deduplication. Paper presented at the Proceedings of the 4th ACM international workshop on Storage security and survivability.
Straub, J. (2004). The Digital Tsunami: A Perspective on Data Storage. Information Management Journal, 38(1), 42-50.