Fixing Bad Drive Sectors on Linux

There’s some weird thing with modern SATA drives where they will fail a sector on read, but do the proper rewriting when asked to, and SMART still reports a perfectly healthy drive after the rewriting.  Conventional wisdom is that such a situation means that all spare sectors on a track have been used up, but it appears the drive firmwares are more complex now, and that’s no longer necessarily so.  Still, the default behavior is difficult for the admin.  And who really knows what happens inside that black- box firmware on the drives?

In this kind of situation, you may see, for instance, a RAID drive member fail out on resync, and I/O errors for bad sectors in the kernel log.  Re-adding the drive will just cause it to fail out again.  But if you rewrite the sector, then you can re-add the drive without it failing out, and everything seems to go swimmingly. (See the sample syslog section in the attached script file for an example of this drive havoc.)

Do you still trust that drive?  I think that’s a pretty hard call.  The drive manufacturer isn’t going to take a warranty claim on a drive  that reports 100% OK via SMART.  Drives are so dense these days that we’re constantly depending on error correction to even make them run.  So, why isn’t this re-mapping always automatic?  It has something to do with read timeouts, but I can’t see why this behavior is the right choice for any situation.  Re-map and report via SMART, IMHO.  This  behavior is what I notice on Hitachi drives – others may well vary  because it’s up to the drives’ firmware to define the behavior.

In the meantime, commercial drive utilities such as SpinRite will take advantage of this anomoly and rewrite drives’ sectors for repair or maintenance, and do a good job on the whole drive.  The attached script does not replace that.  It merely rewrites that sectors that are already causing problems, with no preventative capability.  Fortunately, a weekly RAID resync will help catch these sectors, hopefully before disaster strikes.  You do have all your data on RAID, don’t you? You are running weekly resyncs, right?  You do get the results of those in your inbox, no?  Consider what would happen if the same sector on both drives in a mirror died at the same time (however unlikely) and if you’re willing to take the chance of using this approach.

Thank you for reading the above, so that you can understand the problem domain here.  It’s up to you whether you want to proceed with the attached script and trust  the drive.  Since the operation here is potentially very destructive, and should never be run on a disk where the drive is in a functional array or (heaven forbid) mounted singly, it’s up to you to look at the found dmesg line, make sure it’s parsed right (different distros may  format dmesg differently), make sure the rewrite commands look good,  copy and paste the hdparm command lines, and make sure that they succeed,  before re-adding your drive to your RAID array.  Sysadmins have a job at least until hardware becomes more reliable.

Script: rewrite_drive_sectors