I was asked for a URL of this article. It's available at storage magazine: http://tinyurl.com/ygjlcj
(Original URL is http://searchstorage.techtarget.com/magItem/0,291266,sid35_gci1216875,00 .html , but that is probably truncated.) --- W. Curtis Preston, Author of O'Reilly's Backup & Recovery and Using SANs and NAS VP Data Protection GlassHouse Technologies -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Curtis Preston Sent: Thursday, December 07, 2006 11:18 PM To: Veritas-bu@mailman.eng.auburn.edu Subject: [Veritas-bu] Backups vs archives Based on some previous posts, I'd like to throw out the following thought, and see what you folks think about it. If you want archives, make archives. Use NetBackup's archive feature, or you use Enterprise Vault (or some other actual archive product). You don't make NetBackup backups and hold on to them for 7 years. Using backups as archives can actually make an actual e-discovery process VERY painful. I submit for your consideration an article that I wrote a few months ago: A bottle of grape juice left on a shelf long enough will ferment - but no one would call it wine. Backups left on a shelf long enough will allow one to restore old data - but no one should call them archives. Like a good wine, an archive should be made for a specific purpose using an application designed to create archives. This article will start with a look at the business requirements for archiving, followed by a discussion on why backups make lousy archives, and will end with a discussion of the discussion of the types of products designed to meet archive requirements. Archives are for the logical retrieval of information. That is, they allow one to retrieve information grouped in a logical way. The first way that archives manifest themselves is the storing of reference data, such as: * The CAD drawings, parts lists, and other manufacturing information for a widget a company used to make * All of the information pertaining to a former customer * All pertinent information regarding a closed project, account, law case, etc. * Tax returns, financial records, or other records for a particular year Information that can be grouped in a logical way can be archived and stored in such a way that a company can retrieve it via that logical grouping. Once a case is closed, a widget is no longer produced, a tax year has past, etc, the information pertaining to that item is just taking up space. We might need to reference it again for some reason, but we don't want it filling up our high end storage, either. So we archive it and delete it. If we need it five years later, we search the archives for "Widget XYZ." The second way that archives manifest themselves is in the logical storage of active data. Suppose, for example, that it was discovered that a critical safety part was taken out of the design of a particular widget. It would be important to be able to see every version of the specification, along with information about who changed it. Also consider the now rather common practice of electronic discovery of email systems. Think about the discovery requests that can come from someone in management being accused of harassment or discrimination; a trader being accused of promising financial returns, or a company being accused of collusion with competitors. Such accusations result in electronic discovery requests that look like the following: * All emails from employee A to employees B, C, and D for the last year. * All emails and instant messages from all traders to all customers for the last three years that contain the words "promise," "guaranty," "vow," "assure," or "warranty." * All emails that left a company going to domains x,y,and z or to these email addresses In summary, archives can contain the only copy of inactive (or reference) data, or a reference copy of active data. Backups make lousy archives The most common way that people archive data is to simply keep their backups for a long time. They perform a weekly or monthly full backup, and then keep that backup for anywhere from one year to fifty years, depending on their business requirements. There couldn't be a worse way to archive. There are many difficulties with using backups as archives, depending on which type of archive we're talking about. The most common use of backups as archives is for the retrieval of reference data. Companies take one full backup per month and hold on to it for many years - indefinitely in some cases. The idea is that if someone asks for the parts for widget ABC (or some other piece of reference data), we'll just restore the appropriate files from where the system where they used to reside. The first challenge with that plan is simply remembering where the files where several years ago. Can you remember the name of the fileserver or database server that you used three years ago - let alone seven years ago? The next challenge is the number of operating systems or application versions that come and gone during that time. To restore files that were backed up from apollo five years ago, the first requirement is a system named apollo. Someone's also going to have to handle any authentication issues between the backup server and the new apollo, since it isn't the same apollo that it backed up from five years ago. Depending on the backup software and operating system in question, the new apollo may also need to be running the same version of the operating system and applications the old apollo was running five years ago. Otherwise, there may be incompatibilities in the filesystem or database that's being restored to. Backups are also used to satisfy electronic discovery requests, and doing this can be even more challenging. Let's use the most common electronic discovery request as an example: request for emails that match a particular pattern and were sent via an Exchange server. (The concepts below also apply to other email systems, such as Lotus Notes or SMTP, but we'll use Exchange as an example.) There are two very large challenges with using backups to satisfy such a request. The first challenge is that it is actually impossible to retrieve all emails sent or received by a particular person. It's only possible to restore those emails that were present in the Exchange server when backups were made. If someone sent an email that the discovery request is looking for, deleted it, then cleared their Deleted Items folder, it wouldn't be on that night's backup, and thus would never show up when attempting to retrieve it weeks, months, or years later. Therefore, it's technically impossible to meet the discovery request using backups. This means that even after doing your best to successfully satisfy the discovery request, a plaintiff may claim that you have not proven your case. (Remember that in America, the burden of proof is different in civil suits. They do not have to prove their case beyond a reasonable doubt. They must only provide a preponderance of evidence.) The second challenge with using backups to satisfy an exchange electronic discovery request is that it's quite difficult to retrieve months or years of e-mails using backups. Suppose, for example, a company performs a full backup of their exchange server once a week, and for compliance reasons they hold onto these backups for seven years. If they received an electronic discovery request for e-mails from the last seven years, they would need to perform many restores of their entire exchange server to satisfy the request. First they would restore their exchange server to an alternate server using last week's backup. (Let's not forget that an alternate server Exchange restore is not that easy to do.) Then they would run a query against exchange to look for the e-mails in question, saving them to a PST file. Then they would restore their exchange server using the backup from two weeks ago, rerun the query, and create another PST file. They'll end up restoring their entire exchange server 364 times before they're done (seven years times 52 weeks). Of course, almost every step in this process will have to be done manually. The real challenge here is that the scenario described above is not impossible. It will cost that company an incredible amount of time and money, but a plaintiff in a civil suit or the government doesn't care how much it costs the defendant. The only thing you need to know is that you have a court order to produce this information - regardless of how much it costs. Backups are also an extremely inefficient way to store archives. Where an archive system will make sure that it has one or two copies of a particular version of a file, a backup system usually has no such logic. If a company is using weekly full backups as archives (or creating "archives" with their backup product but not deleting the original files), and they're storing their archives for seven years, they'll have 364 copies of many of their files on tape - even if those files have never changed. This leads to an incredible amount of media waste. The other thing that we don't like to talk about when discussing backups as archives is the number of times a given company changes backup formats and tape formats over the years. Almost every company using backups as archives has a number of older tape and backup formats that they must continue to support for archive purposes. While older tape formats can be converted with a lot of copying, converting older backup formats is a whole different challenge. Most people choose to hold onto both old tape formats and old backup formats and hope they never actually have to read them. True Archiving The most important feature of an archiving system is that the archive should contain enough metadata to be able to retrieve the information in logical ways. For example, metadata can include the author, or business unit that created an item. (An item can be any piece of archived information, such as a file, a record from a database, or an email.) Metadata might also contain the project that the item is attached to, or some other logical grouping. An email archive system would also include who sent and received an email, the subject of the email, and all other appropriate metadata. Finally, an archive system may also import the full text of the item into its database, allowing for full text searches against the archive. This can be a very useful feature, especially if multiple formats can be supported. It's very nice to be able to do a full text search against all emails, Word documents, PDF files, etc. Another important feature of archive systems is their ability to store a pre-determined number of copies of a given archived item. The number of copies a company chooses to keep is up to them and is based on what they want to protect from. For example, if they're storing their archives on a RAID-protected system, they may choose to have one copy on disk and another on a removable medium such as optical or tape. Archive systems manifest themselves in two ways. The first type of archiving system is the traditional, low-retrieval archive system attached to your backup software package. You can make an archive of a selected group of files and attach limited metadata to it, such as "Widget XYZ," and then have the archive system delete the files in question. The good thing is that it allows the attachment of metadata, and can reduce multiple copies in the archive by deleting files as they're archived. The bad news is that if you want to be able to search archives via different types of metadata, such as owner, time frame, etc, you would need to create multiple archives. The main use for this type of archive is to save space by deleting files attached to projects or entities that are no longer active. Newer archive systems realize that any given archived item might need to be retrieved for different reasons and would thus require different metadata. To support multiple different types of retrievals, it's important to store the actual archived item only once, but to store all of its metadata in a searchable database. Such a system also realizes that a given archived item might be put into the archive not to save space, but to allow it to be searched for logically. Therefore, unlike their predecessors that stored the only copies of reference data, these newer types of archives tend to store an extra copy of the data, leaving the original in place. One of the problems discussed previously with using backups as archives is that they won't have all occurrences of a given file or message; they will have only those items that were available when the backup was made. One of these newer archive systems solves this problem by archiving data automatically. For example, every email that comes in or is sent out is sent to the archiving system. Every time a file is saved, a version of the file is sent to the archive system. Another advantage of modern archive systems is their use of the single instance store and delta incremental concepts. They store only one copy of a given file or email, no matter where it came from or who it went to. (They, of course, record who it came from or who it was sent to.) If that file or email is then changed and sent/stored again, they can store only the changed bytes in the new version. This allows for incredibly efficiency when storing many files or emails. As to the format issues of backups as archives, many archive systems still have those issues. Many people still store their archives on tape, and as time passes people will change their archive software. Therefore, this problem could continue to exist even in archives. See the sidebar about the use of CAS disk as an archive target. Another secondary features of modern archiving systems is that they can also serve as an HSM-like system, automatically deleting large, older files and emails, and invisibly replacing them with stubs that automatically retrieve the appropriate content when accessed. This is one of the big business justifications used to sell email archive software. In addition to being able to satisfy electronic discovery requests, you can save a lot of space by archiving redundant and unneeded emails and attachments. Surveys shows that over 90% of typical email storage is consumed with attachments. If you can store only one copy of such an attachment across multiple email servers (and Exchange Storage Groups), and replace it with a stub, then you can save a whole lot of storage. If you add delta-block incrementals to that, you can save even more storage. While the HSM-like features of most newer archiving programs may seem more compelling and provide more direct savings, they should be seen as a secondary reason for archiving. The primary reason for archiving should be that you've got a valid business reason for doing so - and that an actual archiving system might actually meet that business requirement. If your company has more than one employee, they probably have a business case for archiving. And if you're using backups as archives, you could be in for a rude awakening when you get an electronic discovery request. Perhaps you should look at an email archiving product or an enterprise content management (ECM) product today. Sidebar: Disk or tape for archiving? This article mentioned that the archive industry may suffer the same issues as the backup industry if customer use tape as their primary storage, and occasionally switch archiving vendors. Can we do better? One idea might be to use a content addressable storage (CAS) device as the primary storage device for your archives. If the product supports a standard filesystem interface, such as NFS or CIFS, and it supports single instance storage and delta block technologies, it could solve a number of problems. First, a disk product using single instance storage and delta block incremental technologies could actually be cheaper to operate than a tape-based system. This will also always be the case, since you really can't apply delta block technologies to tape based systems. Therefore, the first problem we solve is disk systems being more expensive than tape systems. Second, if the CAS device supports a filesystem interface, then migrating between storage systems should be relatively simple. With a tape based system, we have to copy all data from the old tape format to the new tape format. With a filesystem based system, you could simply copy data from the older device to the newer device. Finally, you could potentially solve the format issue as well. If archive products can support discovery of existing CAS systems, you could theoretically switch archive products with no ill effects. The raw data would still be accessible via the filesystem interface, and the metadata could be imported - or the new archive system could grab the metadata from the CAS device. Your mileage will definitely vary here, but solutions to this problem do exist. Sidebar: Turning backups into archives? Another common question is what to do when switching away from backups as archives. What to do with all the old tapes in the old backup format(s)? The answer is the same as it is for changing backup formats. The only thing you can do is restore the oldest versions of the data being archived, archive it, delete it, then restore the next version. It's not pretty, but it's reality. The good news is that every backup that you turn into an archive means storage savings. --- W. Curtis Preston, Author of Backup & Recovery and Using SANs and NAS VP Data Protection GlassHouse Technologies _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu