Business Services Industry

SEPATON CTO Miklos Sandorfi Outlines Key Considerations for Meeting the Deduplication Needs of the Enterprise

Business Wire, Nov 6, 2007

MARLBOROUGH, Mass. -- The volume of data generated by most companies has grown at such an explosive rate that many data centers are running out of space, power, cooling, and storage capacity. Issues of insufficient capacity are being compounded by increasingly stringent regulatory requirements and business initiatives demanding higher service levels, longer online retention times, and higher levels of data protection. Data deduplication technology is rapidly emerging as an effective solution to significantly offset data growth and meet regulatory and business requirements.

SEPATON, Inc.'s Chief Technology Officer, Miklos Sandorfi, cautions enterprises that data deduplication approaches vary and outlines key considerations for choosing the approach that best meets the needs of large enterprises.

Know the Basic Approaches of Data Deduplication - There are two basic categories of data deduplication technology: hash based and byte-level comparison deduplication. The hash-based approach runs incoming data through a hashing algorithm to create a small representation of the data and a unique identifier for that piece of data called a hash. It then compares the hash to previous hashes stored in a lookup table. If a match is found, then the duplicate data is replaced with a pointer to the existing data. If a match is not found, the data is added to the lookup table.

An alternate approach is utilizing byte-level comparison technology. Here, pattern matching is used to find duplicate data; since actual data comparisons are made, there is no data integrity risk. Some solutions take this a step further by using built-in intelligence about the actual file content for comparing data as objects (e.g., Word document to Word document or Oracle database to Oracle database) and identifying potential redundancies. Unlike other technologies that use the first instance of a file as the reference copy, enterprise-class implementations use the most recent copy and replaces older duplicate data with pointers. As a result, this technology eliminates the need to reconstitute new data from multiple reference points enabling instantaneous data restoration.

Distinguish between Inline vs Post-Processing. A key distinction between deduplication technologies is whether the deduplication process is done in-line as part of the backup process or as a post-process. Deduplication performed inline requires slightly less capacity and is adequate for relatively small backup requirements. However, this method has a significant negative impact on performance and cannot complete large backups required by enterprise organizations within typical backup windows. An alternative method completes backups at full, unimpeded performance. The deduplication process is started as soon as the backup process begins and continues in parallel with the backup in a fully integrated operation. The main benefit of this post-process method is that it can handle much larger volume backups within a typical eight-hour backup window. In addition, because it backs up a full set of data, post-process method enables a more rigorous data integrity checking capability.

Choose a Solution that can Backup and Restore Petabytes of Data. A primary consideration in choosing a backup technology for an enterprise or large enterprise is the solution's ability to handle terabytes or petabytes of data while staying within the backup window. The objective being to avoid creating dozens of separately managed "silos" of storage.

Ensure High-Performance Over Time. Many solutions see a marked degradation in performance over time as data becomes more fragmented across the disk and the database when duplicate data storage expands. Choose a solution that delivers performance regardless of the timeframe.

Set Realistic Expectations for Capacity Reduction. Deduplication approaches and results vary widely among solutions as does the time required to achieve maximum deduplication. The effectiveness of deduplication technology also depends heavily on the specific backup policies, the application and the mix of data types that are being backed up.

Check Restore Performance. Backing up data quickly is only half the challenge. To be successful, data needs to be restored quickly and efficiently. In fact, one of the key drivers for adopting deduplication technology is the ability to keep data on disk longer in order to simplify and accelerate restore times. Before adopting a new deduplication technology, be sure to test restore times and efficiency. Most restore requests are for data that is less than two weeks old. Solutions that use the first backup as the reference copy must recreate the most recent backup from weeks or months of pointers. In contrast, solutions that use the most recent backup as the reference copy can restore that data nearly instantaneously.

Ensure Data Integrity. Enterprise deduplication requires guaranteed data integrity. Some deduplication algorithms can result in data integrity issues. Look for solutions that guarantee data integrity. Enterprise class solutions perform a data integrity check that compares the deduplicated data to the original data set at the byte level before any duplicate data is deleted or disk space is redeployed. This comparison needs to ensure that when deduplicated data is reconstructed, it is byte for byte identical to the original backup.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here

Content provided in partnership with Thompson Gale