So here it goes.
First of all, the following worked in my case, but may (and will probably) not work in a good many others: there's a huge number of ways in which a hard drive can fail, and this is but one. Besides, while the issues I encountered turned out to be a worst case scenario in several ways, as you will see, I was still very, very lucky in one important thing: while on the high level of files and directories and such the disk was hosed, on the lower level of 0s and 1s, I could still manage to read from it.
It all started with a complete freeze of the computer. After restarting, the computer would not boot. The BIOS seemed to scan the disks alright, but when attempting to boot, a confusing and not always consistent error message would be displayed:
GRUB Loading stage 1.5. GRUG loading, please wait... Error 17
GRUB is what is called a boot loader. A boot loader is a tiny bit of program whose purpose is to load the computer's operating system and start it. Some, like GRUB itself, are able to let you choose from several installed operating systems, typically Windows and your chosen flavour of Linux.
The GRUB manual teaches us that an error 17 means GRUB couldn't sort out its kitten with the partition you want it to boot from. This is, as you can tell, not exceedingly helpful, although it will serve as a useful clue later on.
Using a bootable Linux CD and trying to access the disk in question confirms the problem: the system does take note of the hard drive device plugged in, but essentially can't do anything with it.
Extracting the raw data
At this point, the number one priority is trying to extract as much data as can be from the device.
When you look it in the face, what a hard drive boils down to is a very, very long list of bytes. Physically, it's organized in a stack of thin magnetic disks, chunked up into heads (one magnetic head for each side of a magnetic disk), cylinders and sectors (circular and radial subdivisions of a given disk, respectively); a hard drive is able to inform the computer of its number of heads, cylinders and sectors, which is generally referred to as the geometry of the disk, noted as the triplet (H, C, S). The total number of sectors, H times C times S, gives the total size of the drive. (I think the number of bytes per sector is standardized, although don't quote me on that -- the hard drive might advertise that too, for all I know.)
But, once the computer has the geometry of the drive figured out, which is done at a very low level, that of the BIOS, what the drive amounts to is a long, long, long list of bytes.
The first part of my work was thus to make it so that the BIOS would get the correct idea about the drive's geomety.
luna_the_cat suggested that I "wrap the HD up in a couple of good sealable plastic freezer bags so that it doesn't get too many condensation problems, then pop it in the freezer for half an hour to a couple of hours," and then plug it back into the computer while it's still cold. She has already recovered data off several broken disks thanks to this trick. I tried. I do not know to what extent it helped; I also fiddled with many BIOS parameters, including some having to do with hard drive geometry detection. But, in the end, I figured that at some point of the fiddling, the computer did start recognizing the drive correctly, even if it still wouldn't do anything with it; although I would be hard pressed to say what exactly made it work.
But at that point, I didn't need more: I could read from that long, long, long list of bytes, and that was all I needed to begin the recovery, thanks to one of several wonder tools that helped me along: ddrescue.
ddrescue is a small and fairly simple Linux tool that does one thing, but does it well: you point it to some file or device, and it reads from it sequentially, and saves what it could read where you tell it to. Where read errors prevent it from accessing the data, it retries as many times as it can, attempting to narrow down its reading to the closest zones to the broken area. This way, it maximizes the amount of data recovered around the broken parts. For the details of its use, I'll refer you to its documentation.
But, this is where I hit the first serious snag: copying an entire 80GB worth of hard drive means that the output will be one single 80GB file. I didn't have that much space. But the hard drive that I had bought for my backups turned out to be almost sufficient; if I copied all but the last GB or so off the dead drive, it would fit. Since the dead drive was far from full, I decided to attempt it, figuring that missing 1GB out of 80 would not be much of a matter, especially as that last GB was unlikely to contain much. This turned out to be a serious mistake, as you will see.
The ddrescue process took hours and hours; I let it run over the night. When I came back the next morning, ddrescue had retried countless times over countless sectors of the hard drive, but managed to read nearly all of them in the end. Only a small chunk of about 180kB couldn't be saved at all. Out of nearly 80GB, that's pretty insignificant. ddrescue did a wonderful job.
Except that the 180kB in question were, as it turns out, in the worst possible location.
Recovering the partitions
At this point, I've got one huge file of nearly 80GB, about 180kB of which are known bad. This is a step in the right direction, but still a far cry from recovering my data. What I now need to do is dive deep into the content of that file to figure out the disk's original structure.
Thankfully, the way the contents of hard drives is structured is well known and documented. In a working PC hard drive, you know what the first few hundreds of bytes are supposed to contain. To make it short: at the very beginning of the hard drive, is where the boot loader is located, under the form of a tiny, tiny set of machine language instructions. Then, what is called the partition table follows.
The partition table is a set of bytes that describe how a physical hard drive is sliced up into logical drives, or partitions, which are the C:, D:, E: and so on that you actually use on Windows. What those bytes amount to is a list of four entries that tell the computer, essentially, "Okay, so here you have a partition of this type, and it begins here on the drive, and ends there". The type is just a code that helps computer operating systems figure out what to do with any given partition, and might generally not be otherwise significant, if you except that a partition with the wrong type might not show up in your computer's list of drives, which would be a bummer.
But, here, the most attentive of you might be thinking, "Four entries? Isn't it possible to have more than four partitions on one hard drive?" And the answer is, yes.
What computers do is, they use a special type for the last partition, which they call the extended type. When reading a partition table and encountering an extended partition, the computer goes there and treats the entire partition as if it was a whole hard drive, potentially containing several partitions. So, yes, you could put a boot loader at the beginning of an extended partition; it would just never be used.
One thing of note is that in this configuration, only two partitions are declared in the partition table of the extended drive: the actual partition, that will show up in your drive list, and, if relevant, another extended partition that will contain the next drive. And so on, and so on, up to as many drives as you want.
The important thing here is: each extended partition is nested into the next one, with its own partition table, and this means that, to get the entire list of partitions on a hard drive, you must read the first partition table, see if it points to an extended partition, and if so, jump to that partition, read its own partition table, add the first partition there to the list of known partitions and, if there's an extended partition there as well, jump there, read the partition table there, etc, etc, etc.
Meaning that the information about what partitions exist on your hard drive is spread across the entire disk; and that if one of the partition tables somewhere in the chain is damaged, you can't find the following partition tables either.
I imagine you already see where this is headed.
But before I get there, let's first look into how I went about exploring the intricacies of the huge, nearly 80GB file that I was working with.
What I needed to do was open that file at arbitrary places, read bytes there, try to make sense of them, and based on that information, go read elsewhere, wash, lather, repeat. Thank heavens, I came across a tool that HUGELY simplified this task: Hachoir.
Doing the task manually would have amounted to trying to make sense of an arbitrary list of bytes, that is numbers comprised between 0 and 255 -- horribly tedious and error prone at best. So my first intuition was to program a tool that would let me describe a binary structure in terms of what bytes in what position mean what, and then use that tool to describe the structure of a master boot record, thanks to my newly acquired knowledge of those, and then apply that tool to my big file to figure out what was left of the low level structure of my drive and move one step further into the recovery of my data.
Turns out that such tools exist already, and this is exactly what Hachoir does.
Better still: Hachoir already comes with the description for a master boot record. Meaning I didn't even have to worry about writing one.
Things weren't all that rosy, though: while Hachoir looks like a damn promising tool, it's still vastly undocumented, and its programming interface is still a bit awkward to handle. Thankfully, it is written in the Python programming language, which is well suited to exploratory fiddling.
Here's an example session of working with Hachoir, assuming that you unpacked the hachoir-core and hachoir-parser archives into the current directory.
Python 2.4.3 (#1, Mar 7 2007, 04:04:43) [GCC 4.1.1 (Gentoo 4.1.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> ## Firstly, we add the Hachoir directories to the list of places where Python will look ... ## for modules. ... import sys >>> sys.path.append("hachoir-core-0.8.0/") >>> sys.path.append("hachoir-parser-0.9.0/") >>> >>> ## Then we can import the modules we need. ... from hachoir_parser.file_system.mbr import MSDos_HardDrive >>> >>> ## 'mbr' is a module that contains all the functions for parsing a Master Boot Record. ... ## 'MSDos_HardDrive' is the parser that is able to understand the layout of hard drives. ... ## The odd naming and organization of Hachoir's module took me a moment to figure out. ... >>> ## Another difficulty with Hachoir is that its parsers only work with its own input stream ... ## classes, so you can't give it a standard Python file or stream of bytes and hope it ... ## to work. It took me a while to figure that out! ... ## Thankfully, there is a small class that helps open a file and return a Hachoir stream. ... from hachoir_core.stream.input_helper import FileInputStream >>> >>> ## Here, I'm going to parse one of my working hard drives; when I did it to recover my ... ## data, I parsed the file I had dumped with ddrescue. ... my_hardrive = FileInputStream( unicode( "/dev/hda" ) ) >>> >>> ## Note how the class in question DEMANDS that the filename be explicitely converted to ... ## Unicode. I told you, Hachoir is a bit of a pain at times. ... >>> ## Anyway. Now, I've got a Hachoir stream pointing to my hard drive, and I can feed it to ... ## the Hachoir parser that understand disks. ... parsed_harddrive = MSDos_HardDrive( my_hardrive ) >>> >>> ## There. Now we can write a little function that uses the information in parsed_harddrive ... ## to display the layout of the information that the Hachoir parser could figure out. ... def displayTree( tree ): ... for field in tree: ... print field.path ... if field.is_field_set: displayTree( field ) ... >>> displayTree( parsed_harddrive ) /mbr /mbr/program /mbr/header /mbr/header/bootable /mbr/header/start_head /mbr/header/start_sector /mbr/header/start_cylinder /mbr/header/system /mbr/header/end_head /mbr/header/end_sector /mbr/header/end_cylinder /mbr/header/LBA /mbr/header/size /mbr/header /mbr/header/bootable /mbr/header/start_head /mbr/header/start_sector /mbr/header/start_cylinder /mbr/header/system /mbr/header/end_head /mbr/header/end_sector /mbr/header/end_cylinder /mbr/header/LBA /mbr/header/size /mbr/header /mbr/header/bootable /mbr/header/start_head /mbr/header/start_sector /mbr/header/start_cylinder /mbr/header/system /mbr/header/end_head /mbr/header/end_sector /mbr/header/end_cylinder /mbr/header/LBA /mbr/header/size /mbr/header /mbr/header/bootable /mbr/header/start_head /mbr/header/start_sector /mbr/header/start_cylinder /mbr/header/system /mbr/header/end_head /mbr/header/end_sector /mbr/header/end_cylinder /mbr/header/LBA /mbr/header/size /mbr/signature /padding /partition /partition /partition /partition >>> >>> ## But you can also query it about some particular bits of data: ... print ( parsed_harddrive["/mbr/header/end_head"], ... parsed_harddrive["/mbr/header/end_cylinder"], ... parsed_harddrive["/mbr/header/end_sector"] ) (15, 193, 63) >>> >>> >>> ## In my case, the information was incomplete. I could only see the first partition, ... ## and the headers hinted at an extended partition that the Hachoir MBR parser couldn't ... ## unravel. This is where things got tricky. ... >>>
Note that you could use Hachoir in the exact same manner to extract interesting information from a wild variety of formats: the length of an video file, the number of pages in a PDF, etc...
Anyway. There I was, with interesting, but incomplete information about my hard drive.
Because, you guessed it -- from what I can tell, the tiny 180kB that went missing happened to be right across the beginning of my extended partition. And Hachoir could thus not read that extended partition's MBR to tell me just where the actual sub-partition that contained all my important files was.
This is why GRUB barfed too: in order to load up the whole partition table, it needed to read the disk's MBR, and then that of the extended partition... which it couldn't obtain. Hence the confusing Error 17.
Finding the partition
Thankfully, by then, thanks to Hachoir, I knew ROUGHLY where the partition started: some place after the end of the last known primary partition (which happened to be my Windows one, about which I didn't give much of a damn). It turns out it was all I needed.
Let's backtrack to the list of information that Hachoir could extract from my MBR. Did you spot this one?
A signature is a series of predetermined, 'magic' bytes, set at a predetermined in a header, that you can read to make sure this is actually a header of the sort you think it is. Unfortunately, the signature of an MBR is very short (two bytes) and thus occurs randomly in many, many places of a hard drive that have nothing to do with the MBR.
But. It is now time to delve into just what was on my partition.
Partitions, just like disks, really boil down to a long, long list of bytes. Now, of course, what you want to use is a complete tree of directories that contain files who have owners and access rights and all. Somehow, the system must organize that long stream of bytes in such a way that it can show you those files and directories.
That organization is called a filesystem.
There are many sorts of filesystems out there. Old versions of DOS and Windows used one called FAT, which was remarkably inefficient, but was damn simple and is thus still used in such devices as digital cameras and USB keys. Newer versions of Windows use one called NTFS, which is actually pretty damn good, stores files efficiently and all.
The Linux world, itself, has so many sorts of filesystems it's not even fun. (Or maybe it is, depending on where you stand, really.) The one I use is called ReiserFS, and it, too, is pretty badass.
In the same way that MBRs do, a filesystem gives structure to what amounts to a stream of bytes by imposing that a series of bytes in a precise position near the beginning of the partition contains structural information: another sort of header, actually much more complicated than the kind that makes up an MBR.
Thankfully, it, too, is well documented. To a computer guy, anyway.
If you went and clicked the above link, perhaps some magic word caught your attention: 'magic', precisely. The ReiserFS header also has a signature. And it is actually pretty long -- 12 bytes -- and has no chance of occurring randomly on a drive.
Well, almost no chance. Since I'm an open source guy, I've got the source code for my system lying around -- including the source code for ReiserFS. Meaning that I do, actually, have files that contain that signature. But that's only a few files, a few places in the partition, and I figured it was worth attempting.
So I wrote a small Python script that opened my big file roughly at the end of the last primary partition, and then searched, byte by byte, for the ReiserFS signature. It found half a dozen of those.
Based on that, since that signature occurs at a definite position in a header that is itself at a definite position on the partition... I could deduce where the partition started. I had only half a dozen contenders. Thankfully -- and not illogically -- the first one turned out to be correct.
So what I did then is truncate my big recovery file at that position. (Well, in truth, I copied it starting from that position all the way to the end with the Linux utility called dd, but that amounts to the same.)
And thus I ended up with a slightly smaller file, roughly 55GB, that, I hoped, contained all my data.
Recovering the files
One wonky great thing about Linux is that files and devices work the same externally -- or, in other words, devices can be accessed, read and written to as if they were files. This, in particular, means that I could tell the system: "This file there is to be considered as a partition -- now attach it to my system!"
With an unbroken partition, this actually works very fine. In my case, though, no luck.
ReiserFS ships with a set of tools to repair broken ReiserFS filesystems. I tried to throw them at it, but they would barf and abort with somewhat confusing error messages, that I'll admit I don't remember.
So, back to the drawing board. Or, as the case may be, Hachoir. For Hachoir, that damn thing, also comes with a parser that understands the ReiserFS structure.
Let's keep it short. *g* The ReiserFS header, among other information, points to what is called the root node: let's take a shortcut and say it's the place of the disk that corresponds to the '/' directory -- or c:\\ to a Windows user -- and from where all the structure of your drive follows.
And in this case, it pointed to a position in my 55GB partition file that was somewhere around the 56th GB. Remember the last GB that I had to omit for space reasons, way back at the beginning of this post? Turned out it did contain important information. Critical information. With a header pointing to a root node that was outside the partition, even the ReiserFS recovery tools could do nothing.
And of course, to make it more fun, by then, the broken hard drive was dead beyond even the powers of the freezer, so I couldn't recover that missing chunk of data from it.
... So in the end I just copied 1GB worth of zeroes to the end of my partition file.
Surprisingly, it worked. The root node was screwed, but it was there, and that was all the recovery tools needed to start working.
That of ReiserFS, in particular, have a particularly dangerous option called --rebuild-tree. What it does is disregard the information contained in the root node and its children, and scan the entire freaking partition, byte by byte, for whatever looks like files as defined by ReiserFS; and from that fragmentary data, it reconstructs the whole tree of directories and files. Some do get orphaned; they get put in a special directory called lost+found/.
This option is extremely dangerous, and is often brought up as a criticism against ReiserFS; in truth, it WILL likely screw up a partition that doesn't need it.
If you know what your are doing, though, and know you need it...
Well. Let us just say that after a few hours, I could cleanly attach my partition file to the system, without a hitch. Nearly all my files were there.
A good number of files and directories did end up in lost+found/, though. Thankfully, here too, ReiserFS is smart, and stores them under names such as nnnnn_mmmmm, where nnnnn and mmmmm are, respectively, the number of the block of the file's directory on the partition, and that of the file proper. Meaning that all the files in lost+found/ that belong to the same directory begin with the same number nnnnn, even if the name of the directory that corresponds to it was lost.
Since the contents of those files was all right, I could simply open them, and based on the content, deduce what sort of directory they were from, and whether I needed them recovered. Only a few were worth recovering, and I could figure out how to name them based on their contents.
And with that, my recovery was done. I copied the important files to the new system, and I'm keeping the rest around for a while in case I discover I forgot something. But, essentially, despite all that went wrong, I managed to save everything that mattered, and damn am I glad, and a little proud of myself too. :)
... Okay, so that turned out to be a bit longer than I had planned. I hope you still enjoyed the explanations, though! Thank you for reading. :)
*: In my defense, when I started, I thought it would be short.