Every day 2.5 quintillion bytes of data is created. In fact, 90% of the data in the world today has been created in the last two years. This data comes from everywhere: from sensors used to gather climate info., digital photos, posts to social media, videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is “Big Data”.
Big Data spans three dimensions:
- Variety – Big Data extends beyond structured data including unstructured data of all varieties: text, audio, video, click streams, log files, and more.
- Velocity – Often time-sensitive, Big Data must be used as it is streaming into an enterprise in order to maximize its value to a business.
- Volume – Big Data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
When Big Data is stored, it aggregates a lot of redundant data. Data de-duplication is an effective way to get rid of redundant data. A de-duplication system identifies and eliminates duplicate blocks of data and hence significantly reduces physical storage requirements.
Calsoft has helped ISVs in developing data de-duplication solutions that protect a wide range of environments right from small distributed offices to the largest enterprise data centers.
Types of De-duplication:
- File-level de-duplication – Commonly referred to as Single-Instance Storage (SIS), file-level data de-duplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to an existing file is stored. The result is only one instance of a file is saved and subsequent copies are replaced with a “stub” that points to the original file.
- Block-level deduplication – Block-level data de-duplication operates on the sub-file level. As its name implies, the file is typically broken down into segments i.e. chunks or blocks, that are examined for redundancy vs. previously stored information.
Comparison: File-Level & Block-Level Data De-duplication
|Sr. No.||File Level deduplication||Block Level deduplication|
|1.||Save the entire file a second time||Save the changed blocks between one version of the file and the next|
|2.||Indexes are significantly smaller, which takes less computational time when duplicates are being determined||Indexes are larger, hence it takes more computational time when duplicates are being determined|
|3.||Backup performance is less affected by the deduplication process||Backup performance is significantly affected by the deduplication process|
|4.||Requires less processing power due to the smaller index and reduced number of comparisons||Require more processing power due to larger index and higher number of comparisons|
|5.||Store unique files and pointers to existing unique files there is less to reassemble||Require “reassembly” of the chunks based on the master index|
Target or Source-based data de-duplication
- Target Based Data De-duplication – Target-based de-duplication acts on the target data storage media. In this case, the client-server is unmodified and not aware of any de-duplication. The de-duplication engine can be embedded in the hardware array, which can be used as NAS/SAN device with de-duplication capabilities. Alternatively, it can also be offered as an independent software or hardware appliance which acts as an intermediary between the backup server and storage arrays.
- Source-Based Data De-duplication – Source-based de-duplication acts on the data at source before it is moved. A de-duplication aware backup agent is installed on the server which backs up only unique data. The result is improved bandwidth and storage utilization. But this imposes an additional computational load on the backup client.