Deduplication Algorithm

Introduction to Deduplication

According to wikipedia, “Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored.

Methods For DedupLication Algorithm:

  • File-level Deduplication
  • Block-level Deduplication

File-level deduplication watches for multiple copies of the same file, stores the first copy, and then just links the other references to the first file. Only one copy gets stored on the disk/tape archive. Ultimately, the space you save on disk relates to how many copies of the file there were in the file system.

Lets assume a company having a 1000 employee share a common file say “data.txt” which is 10MB in Size. Each employee does the same changes and save the exact similar 1000 copies of file on server. So the estimated storage require to save a file on server side is 10 GB.

If all the files are identical then there is no point in uploading all of them on the server. Just save a single copy on the server and put pointers in a users folder that points to a single copy on server. This is how Data Deduplication technique used to save the TB’s of storage.

Block-level Deduplication, sometimes called variable block-level deduplication, looks at the data block itself to see if another copy of this block already exists. If so, the second (and subsequent) copies are not stored on the disk/tape, but a link/pointer is created to point to the original copy.

Let’s assume we have three users each having 4 data blocks. The green and Gray blocks are common in three users and are backed up in data center. The blue, red and purple blocks are common between two users and are backed up in Data center. Here the Important point is the memory required for block data in data storage is very less which is equal to size of(block) in memory.

Block level deduplication is always efficient than file-level deduplication because in file level one has to dump whole file in data storage if its version changed where as in block level you have to dump the changed block which takes comparatively less space in data storage.

Implemention Of Deduplication Algorithm

1. Start
2. Declare Variable
3. Initialize variable
4. Read 1024 bytes from file in tone iteration
5. Read from file until reach EOF
   5.1 Generate Hash Value from strBuff[BLOCKSIZE]
   5.2 if (FirstBlock)
          Consider node as root element
          Inc BlockCtr
       else
          search the generated Hash in BST
          if (Find Hash == True)
             Compute the Node
             Add the Node to a linked List
             Change the EndLink of SLL
          else
             Add the node in BST
             Inc The BLockCounter
6. Calculate Deduplication Ratio
7. Display the Result for each iteration
8. END

We have a diagrammatic representation which includes BST and SLL that we are created during parsing of file at block level

Here we are using two Important structures.

Binary Search Tree Structure

typedef struct treeNode
{
     TCHAR  tszHash[MAX];
     int iBlockNo;
     struct Sll *structSlink;
     struct Sll *structLlink;
     struct treeNode *left;
     struct treeNode *right;
}treeNode;
  1. Hash <- This filled is used to store the hash generated by CRC64 bit hash algorithm
  2. iBlockno <- each block is having a specific block no.repeated blocks are points to that particular block no.
  3. Slink <- it is start link pointer that contains the address of first node of single linked list
  4. Elink <- it is start link pointer that contains the address of last node of single linked list
  5. left <- it is a pointer that store the address of left child
  6. right <- it is a pointer that store the address of right child

Single Linked List Structure

struct Sll
{
     int iRepBlockNo;
     struct Sll *pLink;
};
  1. irepBlockNo <- Counter variable that keeps track for no of blocks
  2. pLink <- Pointer that holds the address of next nod

Core Logic Loop

  1. As the loop runs it reads 256 bytes of data in one iteration and generate the hash for the same .
  2. Search() procedure ensures that entry is not already present in BST
  3. If the hash already exist
  • Compute the node
  • Add the node to Single linked list
  • change the slink and elink pointers of tree node.

4. if hash doesn’t exist

  • Compute the node
  • Use hash as a key for comparison
  • Add the node in BST

5.Calculate deduplication percentage as a measurement also known as Deduplication ratio

Syntax

while(is.read(strBuff,BLOCKSIZE))
{
    mbstowcs(wcstring, tszBuff,sizeof(tszBuff));
    wstring strTempHash = HashString(wcstring).c_str();
    if(iIncCtr == 0)
    {
         root = Insert(root, strTempHash.c_str(), iBlockCtr, &amp;ppTempStore)
    }
    else
    {
         pTemp = Find(root,tszTempHash.c_str());
         if(pTemp == NULL)
         {
             root = Insert(root,tszTempHash.c_str(),iBlockCtr,&amp;ppTempStore);
         }
         else
         {
            pNwLink = (Sll*)malloc(sizeof(Sll));
            pNwLink-&gt;pLink=NULL;
            pNwLink-&gt;iRepBlockNo=iIncCtr;
            if(pTemp-&gt;structLlink == NULL &amp;&amp; pTemp-&gt;structSlink == NULL)
            {
                 pTemp-&gt;structSlink=pNwLink;
                 pTemp-&gt;structLlink=pNwLink;
            }
            else
            {
               pTemp-&gt;structLlink-&gt;pLink=pNwLink;
            }
 
         }
 
    }
    iIncCtr++;
}
  • Code is having a Time complexity of O(nlogn)

As it shows the output of Block Level Deduplication on 50MB of File

Links And Abbreviation

  1. SLL <- Singly Linked List
  2. BST <- Binary Search Tree
  3. Dedup<- Deduplication

To know more email: marketing@calsoftinc.com

 
Share:

Related Posts

Product Lifecycle Management in Software Development using Large Language Models

Product Lifecycle Management in Software Development using Large Language Models

The data of any organization is of extreme value. But what happens when that data is not trustworthy and accessible to your teams? You will face challenges…

Share:
Kubernetes Introduction and Architecture Overview

Kubernetes: Introduction and Architecture Overview

Containers are taking over and have become one of the most promising methods for developing applications as they provide the end-to-end packages necessary to run your applications….

Share:
How to Perform Hardware and Firmware Testing of Storage Box

How to Perform Hardware and Firmware Testing of Storage Box

In this blog will discuss about how to do the Hardware and firmware testing, techniques used, then the scope of testing for both. To speed up your testing you can use tools mentioned end of this blog, all those tools are available on internet. Knowing about the Hardware/Firmware and how to test all these will help you for upgrade testing of a product which involve firmware

Share:
Cloud Application Development

Challenges of Cloud Application Development

Explore the challenges and solutions of cloud application development, including benefits, performance issues, and overcoming vendor lock-in for seamless cloud integration.

Share:
5 Best Practices in Cloud-native Application Development

5 Best Practices in Cloud-native Application Development

Explore the top 5 best practices in cloud-native application development to ensure your apps are robust, scalable, and efficient. Learn more now!

Share:
Anomaly Detection in Machine Learning Classification Algorithms vs Anomaly Detection

Anomaly Detection in Machine Learning: Classification Algorithms vs Anomaly Detection

Discover the power of anomaly detection in machine learning to enhance operational efficiency, reduce costs, and mitigate risks with the right algorithms and features.

Share: