Jedi QuestMaster wrote:
https://www.howtogeek.com/173463/bad-sectors-explained-why-hard-drives-get-bad-sectors-and-what-you-can-do-about-it/
Thanks! Whoever wrote that is out of their mind; there is no such thing as a "soft" bad sector, and it certainly has no correlation to power loss. They're using "soft" and "hard" terms because those actually come from memory (RAM), where there really are "soft" and "hard" errors (soft = temporary, where subsequent attempts work fine, i.e. a flaky bit; hard = permanent, where all attempts fail, i.e. reliably busted). But based on their weird description, I think they're trying to describe what I call "suspect" LBAs/sectors. I'll try to explain, with some background.
The data contained in an actual sector (either 512 bytes or 4KBytes) -- which an LBA correlates with, just not always 1:1 due to remapping that can happen -- does contain ECC data alongside the raw data written. The ECC implementations tend to be of varying algorithms (depends on vendor, drive age, blah blah), usually Hamming type or Solomon-Reed. Going off of memory here, but there's something like 64 bytes of ECC used data per 512-byte sector, and 128 for 4K sectors. Someone much smarter than me and more familiar with math can probably figure out the exact amounts, but again going off of memory, I think with 512-byte sectors ECC can correct up to something like 24 bits of data, and with 4K it can correct twice that amount. Correlating ECC values are written at the same time as raw data. I think there's a paper released by Toshiba somewhere with details; Google around a bit.
Is it possible that one could end up writing only a portion of raw data and/or ECC, thus in effect, having a "partially written" sector? Sure. ECC is used on an LBA read operation (and there are some vendor-specific ATA commands that can actually do a read that bypasses the ECC region). If the (effective) checksum calculation doesn't match what's in the ECC region, depending on the number of bit errors, the drive can auto-correct this. I can't speak for all drives, but this is tracked in SMART attribute 1 (Raw_Read_Error_Rate).
Is it possible there are too many errors thus ECC cannot correct it? Again, sure. However, in this case, the sector *is not* marked as "bad" -- in the fashion that you think -- but rather is marked (what I call) "suspect". Drive firmwares have all sorts of retry algorithms and heuristics to minimise this situation because what it ultimately means is data loss. Once a sector is marked "suspect" the data there can't normally be obtained. Anyway, the algorithms and heuristics vary TREMENDOUSLY across vendors, drive models, and firmware versions. It's the most common variance point there is.
Anyway, in this situation, the LBA that maps to that physical sector, when read, will return an I/O error. These types of LBAs are tracked in SMART attribute 197 (Current_Pending_Sectors). The drive will re-evaluate if the physical sector is usable (i.e. was this some sporadic error or is it reproducible?)
when the LBA is written to. I've heard rumours that some MHDDs actually do this in the background when the drive is idle for long periods of time, but I've never actually seen such in practise -- I've only seen it happen when the OS issues a write command to said LBA. If the re-evaluation done during a write fails, the drive then marks that physical sector as bad (SMART attribute 5 and 196 get incremented), and the LBA is mapped to a spare sector from that point onward, and the drive writes the data there from then on. If the re-evaluation done during a write is *successful*, the sector is marked usable -- i.e. no remapping happens -- and attribute 197 is decremented.
SMART attribute 198 (Offline_Uncorrectable) is also sometimes involved in severe outcome situations like the above -- but for both reads and writes -- but this more commonly indicates an actual mechanical problem with the drive and not "a problem with a sector" per se.
Also: there's also a lot of stuff that goes on under the hood too, such as some drives storing a kind of "sector history" in general-purpose log regions on the drive (space dedicated for these purposes is limited, but can vary from anywhere between 1 and 100 sectors or so). This helps with analysis decisions by the firmware. MHDD firmware today is *INSANELY* complicated -- sadly, long gone are the days of the "do what the OS says and stay out of the way" approach, instead vendors keep adding all this extra stuff. Sometimes that stuff is great and increases reliability, other times the stuff is a huge problem and gets in the way of what should be a pretty obvious true bad sector situation. Tricky ones are ones like what I dealt with, where a drive was functional for 6-7 seconds after powering on, but then went totally catatonic; all those brains in the firmware certainly can have or result in bugs, case in point. :P
BTW: don't confuse the above type of ECC with that used in RAM (i.e. cache) -- which some MHDDs and SSDs use (many do not however) -- or with some ECC algorithms used by certain manufacturers that allow for auto-detection and correction of certain kinds of data "local" to the drive (i.e. data written to the cache by the OS was fine, but when read from the cache to write to a physical sector, the data was corrupt -- this is usually the sign of bad cache/RAM on the MHDD itself). One company that has historically done this is, believe it or not, HP -- at least in their "enterprise" drives, dunno about consumer stuff. This is tracked in SMART attribute 184 (End-to-End_Error). HP published a paper on it at one point -- they call it "SMART IV" (again, Google around to find the paper).
So, there is no such thing as a "soft bad sector". Whoever wrote that article tried to explain the above probably without having familiarity with actual MHDD operation. This type of crap has become commonplace on the Internet these days. I
rant about it here (bottom of page, "An opinionated footnote" -- the rest of the article has nothing to do with what we're discussing here).
I swear, I've done write-ups like this one/this post so many times in my life that I feel like I'm a broken record. What I can never get, however, is confirmation of behaviour from
actual MHDD engineers/vendors -- nobody talks about this stuff because NDAs, intellectual property, blah blah blah. Some of the people who know for sure work in the data recovery industry (and were previously employees at MHDD vendor companies). It's really too bad, because there's a lot of bullshit information out there these days. Sigh, "enthusiasts" and gamers... :/ So how do I know it? Working with drives and ATA protocol for a while, combined with doing freelance/amateur data recovery (and no, I don't do hardware-level stuff, i.e. I don't work in clean rooms).
smartmontools' Wiki page actually has a write-up on how some of this works in one fashion or another if you're interested. Most of the information though comes from circa 2003-2005, so things today are more complicated. However, the document quickly crosses into the filesystem arena, which is obviously relevant but makes understanding drive behaviour more complicated. And remember: SCSI is *completely* different from ATA in several regards (especially bad sector management), everything above is in regards to ATA drives.
https://www.smartmontools.org/wiki/BadBlockHowto