Deduplication Algorithm

Introduction to Deduplication

According to wikipedia, “Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored.

Methods For DedupLication Algorithm:

File-level Deduplication
Block-level Deduplication

File-level deduplication watches for multiple copies of the same file, stores the first copy, and then just links the other references to the first file. Only one copy gets stored on the disk/tape archive. Ultimately, the space you save on disk relates to how many copies of the file there were in the file system.

Lets assume a company having a 1000 employee share a common file say “data.txt” which is 10MB in Size. Each employee does the same changes and save the exact similar 1000 copies of file on server. So the estimated storage require to save a file on server side is 10 GB.

If all the files are identical then there is no point in uploading all of them on the server. Just save a single copy on the server and put pointers in a users folder that points to a single copy on server. This is how Data Deduplication technique used to save the TB’s of storage.

Block-level Deduplication, sometimes called variable block-level deduplication, looks at the data block itself to see if another copy of this block already exists. If so, the second (and subsequent) copies are not stored on the disk/tape, but a link/pointer is created to point to the original copy.

Let’s assume we have three users each having 4 data blocks. The green and Gray blocks are common in three users and are backed up in data center. The blue, red and purple blocks are common between two users and are backed up in Data center. Here the Important point is the memory required for block data in data storage is very less which is equal to size of(block) in memory.

Block level deduplication is always efficient than file-level deduplication because in file level one has to dump whole file in data storage if its version changed where as in block level you have to dump the changed block which takes comparatively less space in data storage.

Implemention Of Deduplication Algorithm

1. Start
2. Declare Variable
3. Initialize variable
4. Read 1024 bytes from file in tone iteration
5. Read from file until reach EOF
   5.1 Generate Hash Value from strBuff[BLOCKSIZE]
   5.2 if (FirstBlock)
          Consider node as root element
          Inc BlockCtr
       else
          search the generated Hash in BST
          if (Find Hash == True)
             Compute the Node
             Add the Node to a linked List
             Change the EndLink of SLL
          else
             Add the node in BST
             Inc The BLockCounter
6. Calculate Deduplication Ratio
7. Display the Result for each iteration
8. END

We have a diagrammatic representation which includes BST and SLL that we are created during parsing of file at block level

Here we are using two Important structures.

Binary Search Tree Structure

typedef struct treeNode
{
     TCHAR  tszHash[MAX];
     int iBlockNo;
     struct Sll *structSlink;
     struct Sll *structLlink;
     struct treeNode *left;
     struct treeNode *right;
}treeNode;

Hash <- This filled is used to store the hash generated by CRC64 bit hash algorithm
iBlockno <- each block is having a specific block no.repeated blocks are points to that particular block no.
Slink <- it is start link pointer that contains the address of first node of single linked list
Elink <- it is start link pointer that contains the address of last node of single linked list
left <- it is a pointer that store the address of left child
right <- it is a pointer that store the address of right child

Single Linked List Structure

struct Sll
{
     int iRepBlockNo;
     struct Sll *pLink;
};

irepBlockNo <- Counter variable that keeps track for no of blocks
pLink <- Pointer that holds the address of next nod

Core Logic Loop

As the loop runs it reads 256 bytes of data in one iteration and generate the hash for the same .
Search() procedure ensures that entry is not already present in BST
If the hash already exist

Compute the node
Add the node to Single linked list
change the slink and elink pointers of tree node.

4. if hash doesn’t exist

Compute the node
Use hash as a key for comparison
Add the node in BST

5.Calculate deduplication percentage as a measurement also known as Deduplication ratio

Syntax

while(is.read(strBuff,BLOCKSIZE))
{
    mbstowcs(wcstring, tszBuff,sizeof(tszBuff));
    wstring strTempHash = HashString(wcstring).c_str();
    if(iIncCtr == 0)
    {
         root = Insert(root, strTempHash.c_str(), iBlockCtr, &amp;ppTempStore)
    }
    else
    {
         pTemp = Find(root,tszTempHash.c_str());
         if(pTemp == NULL)
         {
             root = Insert(root,tszTempHash.c_str(),iBlockCtr,&amp;ppTempStore);
         }
         else
         {
            pNwLink = (Sll*)malloc(sizeof(Sll));
            pNwLink-&gt;pLink=NULL;
            pNwLink-&gt;iRepBlockNo=iIncCtr;
            if(pTemp-&gt;structLlink == NULL &amp;&amp; pTemp-&gt;structSlink == NULL)
            {
                 pTemp-&gt;structSlink=pNwLink;
                 pTemp-&gt;structLlink=pNwLink;
            }
            else
            {
               pTemp-&gt;structLlink-&gt;pLink=pNwLink;
            }
 
         }
 
    }
    iIncCtr++;
}

Code is having a Time complexity of O(nlogn)

As it shows the output of Block Level Deduplication on 50MB of File

Links And Abbreviation

SLL <- Singly Linked List
BST <- Binary Search Tree
Dedup<- Deduplication

To know more email: marketing@calsoftinc.com

Talk to our Experts for more details

Related Posts

Maximizing Cost Savings and Performance in Data Streaming: A Smarter Approach

Top Technology Trends to Watch Out for in 2025: A Calsoft Perspective

6 Key Steps and Best Practices in Data Quality Management

Empowering Women at Calsoft – Shaping the Future Together

Potential of Multi-Cloud Strategy in Telco Digital Transformation

Top Generative AI Trends Shaping 2025