Data De-duplication

Everyday 2.5 quintillion bytes of data is created. In fact, 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: from sensors used to gather climate info., digital photos, posts to social media, and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is “Big Data”.

Big Data spans three dimension: 
  • Variety – Big Data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files, and more.
  • Velocity – Often time-sensitive, Big Data must be used as it is streaming into an enterprise in order to maximize its value to the business.
  • Volume – Big Data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.

While the Big Data, in different dimensions, is stored it aggregates lot of redundant data. Data de-duplication is an effective way to get rid of redundant data. A de-duplication system identifies and eliminates duplicate blocks of data and hence significantly reduces physical storage requirements.

Data de-duplication Diagram

Figure: Illustration of a typical de-duplication system functions


Calsoft has helped ISVs in developing data de-duplication solutions that protect a full range of environments right from small distributed offices to the largest enterprise data centers.

Types of De-duplication:
  • File-level de-duplication – Commonly referred to as Single-Instance Storage (SIS), file-level data de-duplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to an existing file is stored. The result is, only one instance of a file is saved and subsequent copies are replaced with a “stub” that points to the original file.
  • Block-level deduplication – Block-level data de-duplication operates on the sub-file level. As its name implies, the file is typically broken down into segments i.e. chunks or blocks, that are examined for redundancy vs. previously stored information.
Comparison: File-Level & Block-Level Data De-duplication
Sr. No. File Level deduplication Block Level deduplication
1. Save the entire file a second time Save the changed blocks between one version of the file and the next
2. Indexes are significantly smaller, which takes less computational time when duplicates are being determined Indexes are larger, hence it takes more computational time when duplicates are being determined
3. Backup performance is less affected by the deduplication process Backup performance is significantly affected by the deduplication process
4. Requires less processing power due to the smaller index and reduced number of comparisons Require more processing power due to larger index and higher number of comparisons
5. Store unique files and pointers to existing unique files there is less to reassemble Require “reassembly” of the chunks based on the master index
Target or Source based data de-duplication
  • Target Based Data De-duplication – Target-based de-duplication acts on the target data storage media. In this case the client server is unmodified and not aware of any de-duplication. The de-duplication engine can be embedded in the hardware array, which can be used as NAS/SAN device with de-duplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays.
  • Source Based Data De-duplication – On the contrary, Source-based de-duplication acts on the data at source before it’s moved. A de-duplication aware backup agent is installed on the server which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on backup client.


Contact Calsoft today to solve your de-duplication related challenges.