De-Duplication: 20:1, 50:1, 100:1… Revolutionary technology or marketing hype?
Math is supposed to communicate concepts in a straight-forward and concrete way, right? Fat chance! Just ask an economist to interpret the recent non-GAAP numbers for company X and you will likely get an earful for an hour or two about how they can make the numbers say anything they want. Well, the same is true for some of the math we're seeing in the new de-duplication technologies.
Some definitions might be helpful. De-duplication is the process of eliminating redundant files, bytes or blocks of data to ensure that only the ‘unique’ files, bytes or blocks are stored. On a file level, this might be single instance store. For example, if a user sends a PowerPoint presentation to 15 people, then single instance store would only keep one copy of the file and use pointers as a reference for everyone else to access that file. A very rough byte or block level de-duplication example might be to take that same PowerPoint file, split it up into unique byte or block level "chucks" and then use pointers to rebuild that file when necessary. Why is this appealing? Well, you can imagine that one of those unique "chunks" from the PowerPoint presentation might also be present in hundred's of other files. You are no longer dependent on the file or even file types, but can reference these unique chucks from within all file types. Still with me?
The Math:
De-Duplication hardware and software vendors are coming up with some fairly amazing numbers. De-duplication is proposing data reduction of 10:1, 20:1 and even up to 100:1 or MORE! Is this possible? Is it all lies? Well, no, not exactly. :-) The numbers are possible given the right scenario.
You CAN handle the truth!:
So, what is the truth? The truth is that a lot of factors play into what kind of data reduction you will achieve. Full versus incremental backups, retention, compression, de-duplication within the backup itself, daily rate of data change; all of these factors come into play. You might see a 30:1 reduction or better. You might see 6:1. In comparison to 2:1 compression on tape, 6:1 is still an amazing reduction in data. The best possible scenario for de-duplication is in the area of backups since this data tends to be highly redundant from one backup to the next and especially between full backups. Think about it, how much data is actually changing between those weekly full backups? 1%? 10%? Even 50% means that you are still backing up 50% redundant data. The potential gain in cost savings from reduction in backup data is very appealing to many customers. Think of the cost savings if you can store 20TB of backup data on only 1TB of disk! Think not only about the hardware costs, but data center cooling, administration costs, power consumption; it all saves money! What needs to be done is to explain de-duplication ratios so that they we don't get caught up in the numbers hype.
OK, stick with me on this. If you were to do a full back up of XGB of data for 30 days and that data never changed, then you would end up with a de-duplication ratio of 30:1. Why? Because you sent 30 days of backup data, but 29 days of that data weren't needed or stored because it was all redundant! Still with me? Obviously if you did this for 100 days you would end up with 100:1 reduction. See, it is possible. :-) Now let me ask you, how many of you do a full back up, every night, of completely unchanging data? I have yet to meet one company that does. Yes, we all want to get 100:1 reduction, but what are the problems you are trying to solve? What if de-duplication presented a working solution for your need, but did it with a 5:1 de-duplication ratio? Would you care about the ratio or the fact that you solved your particular problem and need?
Conclusion:
Know your need FIRST, get all the facts from a trusted technologist and then find the tool that meets your need. Don't get sucked in by some company that claims that their product will solve all of your problems. Some of these companies are a one trick pony. They try to make everyone's problem look like a nail to their hammer. Remember, some of these companies said that tape backups would be gone 5 years ago and yet, tape is still going strong. Do the research, find a trusted advisor, and put together a holistic solution that meets your needs! De-duplication may be a great asset to your organization or it might be a cool technology that doesn't quite fit your needs!
Still have questions? Post them here!

