How cloud storage could catch up with big data

Erasure coding is emerging as a cost-effective way to safely maintain vast amounts of data in the cloud.

Cloud computing has managed to make the world’s already colossal appetite for data storage even more voracious.

Last year, IDC, an IT market research firm, cited public cloud-based service providers, from Amazon Web Services to YouTube, as the most significant drivers of storage consumption in the past three years. The government sector contributes as well: IDC noted that the private clouds of government and research sites compare in scope and complexity to their public cloud counterparts.

The so-called big data problem has surfaced in the past two years to rank among the primary IT challenges. Technologies such as the Apache Hadoop distributed computing framework and NoSQL databases have emerged to take on the challenge of very large — and unwieldy — datasets.

And now another technology, already at work behind the scenes, could grow in importance in the coming years. Erasure coding has been around since the 1980s, but until recently its use in storage circles has mainly been confined to single storage boxes as a way to boost reliability more efficiently.

Now erasure coding is moving into distributed storage. Its application becomes trickier here, but industry executives and storage researchers believe erasure coding — particularly in conjunction with increasingly popular techniques such as object-based storage — will play a growing role in cloud storage. Potential government adopters include Energy Department labs and other agencies with vast data stores.

Why it matters

When it comes to storage, everything is getting bigger, whether it’s an individual disk, a storage system or a cloud-based repository. Erasure coding, an error-correcting algorithm, plays a role across this range of ever-growing storage platforms.

Vendors most commonly use erasure coding to boost the resiliency and performance of their Redundant Array of Independent Disks (RAID) storage systems, said Bob Monahan, director of management information systems at DRC, a consulting and IT services firm.

But it’s the use of erasure coding as an alternative to data replication that is attracting new interest in this storage mechanism. In many traditional cases, redundancy is achieved by replicating data from primary storage devices to target arrays at the data center or an off-site location. Mirroring data in that way provides protection but also consumes lots of storage, particularly when organizations make multiple copies of data for greater redundancy. The approach becomes particularly unwieldy for organizations that deal with petabytes or more of data.

Erasure coding offers an alternative way to achieve redundancy while using less storage space, said Russ Kennedy, vice president of product strategy, marketing and customer solutions at storage vendor Cleversafe, which uses erasure codes in its object-based storage solutions.

Organizations that rely on replication might make three or four copies of data — one copy at another location then a copy of the copy to be safe and so on. In comparison, the overhead to make a sufficiently fault-tolerant copy with erasure coding is less than double the size of the original volume, Kennedy said.

Jean-Luc Chatelain, executive vice president of strategy and technology at DataDirect Networks, said financial concerns are driving interest in erasure coding among customers who don’t want to replicate data two or three times. DataDirect takes advantage of erasure coding in its RAID system, file storage offerings and Web Object Scaler product for cloud storage.

The prospect of saving space and money hasn’t been lost on the cloud community. The major providers are on their way to adopting erasure coding, said James Plank, a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee. His research focuses on erasure codes in storage applications.

“Pretty much every cloud installation you can think of is either using erasure coding or converting to erasure coding,” he said, citing Amazon, Google and Microsoft as examples. “They are using erasure coding for fault tolerance because the disk space savings is huge.”

There’s a bandwidth benefit as well. “While the big savings today would come from reduced capacity requirements, the big win, from my standpoint, is the two- or threefold reduction in bandwidth [compared to what is] used during replication,” said Galen Shipman, group leader of the Technology Integration group at Oak Ridge National Laboratory’s National Center for Computational Sciences.

The fundamentals

Erasure coding might have implications for the nascent cloud, but the technology has been around the storage block a few times. In a storage setting, the technique encodes data into fragments from which the original data can be reconstructed.

For example, erasure coding is the underlying technology of Cleversafe’s dispersed storage method, which takes a data object (think of a file with self-describing metadata) and chunks it into segments. Each segment is encrypted and cut into 16 slices and dispersed across an organization’s network to reside on different hard drives and servers. If the organization has access to only 10 of the slices — because of disk failures, for instance — the original data can still be put back together, Kennedy said.

Numerous experts see erasure coding paired with object-based storage as a good option for achieving more fault-tolerant repositories with petabytes and even exabytes of capacity.

The hurdles

Government clouds and data centers have yet to jump on erasure coding, apart from agencies using RAID storage devices that embed the technique.

“It is less well understood and therefore less mature in commercially available solutions,” Monahan said. “As it becomes more mature, the use cases for when it is more appropriate will drive implementation scenarios.”

Performance is another limitation. Shank Shome, a storage engineer at Agilex Technologies, said the impact of erasure coding on storage performance has yet to be fully explored. He added that reading the data back from an erasure-coded system is generally fast, but the real performance cost lies in writing the data to storage.

“If the data is generally static with very few rewrites, such as media files and archive logs, creating and distributing the data is a one-time cost,” Shome said. “If the data is very dynamic, the erasure codes have to be re-created and the resulting data blocks redistributed.”

Erasure coding also runs into problems with high-performance computing. One complication arises when data is being written simultaneously from many sources and at a high rate, said Robert Ross, a computer scientist at DOE’s Argonne National Laboratory and a senior fellow at the University of Chicago’s Computation Institute. That activity requires a level of coordination that isn’t easy with current approaches.

In general, storage experts believe erasure coding faces the biggest obstacle in frequently accessed “hot data.” Accordingly, they believe a key initial use case lies in protecting data that has cooled enough to move to long-term storage.

Monahan said the benefits of erasure coding are “higher local availability at a lower cost and highly available dispersed archival systems that are an order of magnitude less expensive than traditional systems.”

The trick is knowing when to use replication to get data out of a system quickly and when to use erasure coding to create more economical, resilient long-term storage, Ross said.

“Both have important roles moving forward in high-performance computing,” he added.

The Oak Ridge lab is now exploring the use of erasure coding for the Oak Ridge Leadership Computing Facility. That facility already uses RAID 6 systems from DataDirect Networks. Shipman said erasure coding could play a significant role in two distributed storage systems: a Lustre parallel distributed file system and the large-scale archival High Performance Storage System, which uses replication for data integrity and resiliency.

“Erasure coding will likely emerge as a viable alternative to replication due to savings in the media and bandwidth consumed for replication,” Shipman said.

He acknowledged the computational demands of the more advanced erasure-coding techniques but said ongoing research on algorithms aims to minimize that cost.

Next steps: Updating the storage toolbox

As data storage needs continue to grow and cloud-based models introduce new options for distributed systems, agencies should constantly re-evaluate their storage strategies. Specifically, they should:

  • Monitor current storage options. Erasure coding might not be at the top of your agenda today, but if your storage growth is outpacing your budget, it probably makes sense to add the technology into the mix of current or near-term future options.
  • Assess likely use cases. Beyond data archiving, erasure coding could prove useful for maintaining and protecting large quantities of sensor-derived data. For example, Cleversafe recently signed GeoEye, a provider of high-resolution satellite imagery, as a customer.
X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.