Duplicating the Deduplication


Our mission to deduplicate and archive your critical company data using a store (on disk) as efficiently as possible led to several DATASTOR patents. It may seem ironic then, to recognize that having a redundant copy of that store is also important, duplicating the deduplication, as it were. Anybody who started his or her backup career by rotating tapes offsite has recognized the benefit in the event of a site-wide disaster. 

In fact, having a redundant copy of your backups is a DATASTOR best practice, whether it be with copy task synchronization to a second disk device, rotating RDX media, or using a vault on tape or cloud storage, mitigating the single point of failure with the loss of the store. In all cases, you choose which archives within the store to copy, and select a date range, just the latest restore point, or all restore points to copy to a storage device of your choice. Let’s follow the path all the way from production data (on disk) to store (on disk), then to a copy (on disk) and or to the vault (on tape, in this case). 

We looked for a way to become as efficient as possible with ongoing replication. The result is to copy archives between two stores, each with its own interdependent set of archive indexes providing deep awareness of each archive’s contents. This allows for proper identification of the unique items within the archives that require copying, followed by a multithreaded copy of those items during a second phase. Verification with retry of problem files is also integrated into the copy task, providing a copy of a store that is consistent and that knows when it isn’t, initiating self-healing steps. 

The tape vaulting process mirrors the copy process, but the target is a “volume set” instead of a destination store. In addition, vaulting tackles a problem inherent to tape and cloud storage: the inefficient transfer of a large number of small files to these media. The solution: vaulting “containerizes” small files into larger segments for efficient streaming of segments to tape.

During the first execution vaulting transfers of all the restore point data to tape. Subsequently, just the unique bits generated at a sub file level found in new restore points gets written to the last tape in the set associated with the vault, the other tapes being full. When a tape is filled an empty, available tape is assigned to the volume set and used. So over time, the number of tapes in the set grows, and any of the tapes in the set may be relevant during a recovery procedure so they must all be available for recovery. The software will identify for you which tapes are needed for a given recovery operation.

Vaulting occurs at the end of a disk-to-disk-to-tape data protection scheme. The intention of vaulting is long term retention of a growing history of generated archive restore points to a volume set using a single vaulting task, while preserving the deduplicated state of the archived data, appending the latest data to the last tape in the set over the course of months; not to recreate it daily or weekly as legacy full backup or full-plus-incremental solutions required, which might be further complicated with a Grandfather, Father, Son rotation scheme. 

Because of the slow rate of growth in space required to generate a restore point from day to day, the number of tapes required with vaulting tends to grow slowly, too (the bytes written to tape matches the amount stored on disk for a given restore point). Best practice would be to create a new vaulting task on a quarterly or even yearly basis to generate a fresh full set of archive data. 

For those required to have an entire volume set off site to meet SLA policies, since the tapes in it comprise a unit, a second vault may be configured, with its own volume set, with scheduling set to alternate the vaulting tasks.

For more information about the vaulting best practices, see our Theory of operations for vaulting to tape.