|Author: Udo Rader
|Creation date: 2003-08-11 Last revision: 2005-10-17
Every sysadmin's nightmare: You made a backup of important files using tar
and for whatever reason you need to restore the files - but find the tar
This thing happened to me once (and hopefully never again) and it took me
quite a very long time to get the data back (or at least the useable
part of it).
Before we start, some assumptions to make things clear:
- tar is GNU-tar
- your archive has been bzip2 compressed as well
(although the compression type is secondary)
- you have the tar-file ready on some accessible place
(GNU-)tar itself has some options that claim to be suitable for recovering
data from lost (you'll understand the sarcasm here if you read on ...).
So let's first check what the problem is:
% tar xjf the_bad_bad_backup.tar.bz2
bzip2: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to *attempt* to recover
data from undamaged sections of corrupted files.
tar: 56 garbage bytes ignored at end of archive
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Now this indicates that I should use bzip2recover
"to *attempt* to
recover data from undamaged sections. Well, doesn't sound too bad, does
So I used bzip2recover:
% bzip2recover the_bad_bad_backup.tar.bz2
That way at least something happened. Depending on the size of the archive,
bzip2recover produces a nice amount small 'rec*' files (typically 900K in
size) which represent the default blocksize bzip2 uses per default for
compression. The "nice amount of small files" however is likely to become a
"huge amount of small files" if your archive is big - like mine was.
The archive I had to deal with was more than 200MB big, leaving me
with several hundrets(!) of those "small files". But still I was optimistic
that I could retrieve the data from the small files by finding the corrupted
files. So I tried to find out, which of the small files was corrupted and
which ones were good:
bunzip2 stops when it finds the first (and hopefully last) corrupted file,
which is exactly what I wanted to know. Krush kill and destroy: No use for
a corrupt file and so I deleted it and repeated the above command plus the
deletion for all further bad files. The only important thing is to
remember the number of the deleted files.
So now I thought it would be easy: use tar on the bunzip'ed files, but I
was taught otherwise. Say that rec00199 was the first (and last) corrupted
file, so starting at rec00200:
% tar xf rec00200
tar: this doesn't look like a »tar«-archive
tar: jumping to the next header
tar: error upon exit caused by previous errors
Headache time ... I could also try it with any of the >200 remaining
allegedly "fixed" files, but always got the same error. Searches in
google and postings in some mailinglists did not provide me with any useful
results and my headeache grew.
Tar claims to have the feature to scan even corrupt files for tar headers
in it but this feature has one major blemish: It only works, if no bytes are
lost in the file because tar scans expects file headers to be 512 bytes in
size. If only one byte is lost in such a header (or a following data block),
this "recovery feature" fails and becomes an annoyance.
Luck returned a couple of weeks later when I received an email from a nice
guy that had written a nice perl script that really searched a file for a
tar header bytewise and not in the 512 bytes manner of tar itself. You can
download it here.
In order to get things working, I joined the second part of the bunzip'ed
files (the ones after the bad rec00199):
% cat rec00[2-4][0-9][0-9] > good_tail.tar
The command above joins all files starting at rec00200 up to rec004999 together
And now the only thing I had to do was to use the script below to find the
position of the first good tar header in good_tail.tar:
% perl find_tar_headers.pl good_tail.tar
The only thing that matters is the first line of the output, it tells
that the first good tar header in good_tail.tar is at position 17185.
What remained was to extract the content starting at this position and
then untar it:
% tail -c +17185 good_tail.tar > extracted_tail.tar
% tar xf extracted_tail.tar
Happy end of the story.