Use ECC everywhere, check your chips
Recently Linus Torvalds was in the news blaming Intel for the lack of ECC memory everywhere. I also recently built my first new workstation in over 5 years. My workstation is literally used for work, my livelihood, so reliability was my top priority. All of my parts were selected for reliability and compatibility with Linux (my workstation runs Ubuntu LTS). One of my specific goals with the Ryzen 3rd gen platform was to use ECC memory. ECC memory is readily available and the extra cost is typically reasonable. However, Ryzen only supports unbuffered ECC DIMMs; not the readily available registered ECC DIMMs. At the time of my build (February 2020), 3200MHz ECC UDIMMs were not available anywhere… at any cost. I conceded to the situation and bought reputable non-ECC DIMMs: Corsair Vengeance LPX. Big mistake. This is my story.
I’ve had cognitive dissonance about ECC memory for along time. I used to write software for supercomputers. I was acutely aware of the fallibility of memory modules. Single jobs ran over hundreds to thousands of physical nodes each with >=128GiB of ECC RAM. It was common enough of a problem that we specifically designed for early detection, fault tolerance, and self-healing in the event of memory (among other hardware) failures. But I never experienced bad memory in all my years on my own PCs. I’ve been building my own PCs for over 20 years. None of my friends who build PCs had ever confirmed bad memory either. I know memory modules fail in supercomputers and servers, but they don’t fail in consumer PCs because… reasons? The reality is I’ve never confirmed bad memory before. If you’re not using ECC or thoroughly testing modules, there is no telling if the odd crash or corruption here and there was actually caused by memory errors. Memory problems in consumer PCs electronics are probably far more common than we know.
The Unraveling
My new workstation build was rock solid for the first couple of months. I was absolutely thrilled. Every once in a while Chromium crashed. I was experiencing a known issue with the snap packaging on Ubuntu because I don’t reboot for weeks or months or even close Chromium for that matter. Eventually Snap would force update Chromium binaries while it was running and it would crash. Fast forward a month and my root btrfs file system detects metadata corruption and forces itself to read-only. If you don’t reboot in this situation all kinds of things start crashing. Then it happened in 3 weeks, then 2 weeks. I occasionally started getting crashes in Chromium when there was no Snap update. Then the Rust compiler started randomly segfaulting. Rerunning it with no changes would build successfully. Panic was really starting to set in. This workstation is critical to my day job and the issues were starting to grit at my productivity. What of a million things could be wrong? Is it software, hardware, or firmware?
I’ve used btrfs on my NAS (with ECC memory) since 2015 and a bit longer on my workstation. I’ve had great success with it. In my NAS I use it in RAID-1 (mirror) configuration with snapshot send/receive to send backups to additional disks. My workstation sends snapshot backups to the NAS multiple times per day. I’ve never lost data with btrfs. The file system detecting corruption and going read-only stood out to me as the most obvious ‘this is really bad’ thing. I wasn’t really ready to blame btrfs. I had seen multiple rants from Louis Rossman about failing Samsung 970 EVOs. I was using the Samsung 970 EVO Plus. badblocks didn’t detect problems. The NVME drive self-test would hang indefinitely though. While btrfs scrubs were still coming back clean, my last day of btrfs sends to my NAS had failed. Btrfs send fails upon checksum errors which is good in that it won’t send corrupted data to your backups. But why are scrubs (which also check the checksums) passing? I ordered a Crucial P5 SSD to replace the Samsung, despite low confidence this was the issue.
I wasn’t able to btrfs send from the Samsung to the Crucial due to the corruption being detected. Unfortunately btrfs doesn’t tell you what files are corrupt. There is a trick though: you can cat
all files on the system to /dev/null
until you find files that cause cat
to error out on read. The corrupt files were not important, so I deleted them and then was able to send my subvolumes over to the new disk. I ran a scrub on the Crucial and everything looked fine. Success! After some reconfiguration of /etc/fstab
and reinstalling GRUB on the new drive I was able to reboot in to the Crucial like nothing ever happened. The first regular send to my NAS worked, and I felt relief. However the next morning I came in to an alert that another send failed. Dread! I scrubbed immediately and it detected no corruption. Now I’m ready to blame btrfs! btrfs send
is detecting corruption but btrfs scrub
is not. At some point through all this I upgraded from the GA 5.4 kernel to the HWE 5.8 kernel. Maybe it’s 5.8 issue? I booted off a live USB drive using the 5.4 kernel and ran btrfs check
. So much corruption! On the Samsung and the Crucial. I ran hashdeep
to md5 sum all the files on a backup snapshot on the NAS and them on a snapshot that was still on the Samsung. Everything matched! ಠ_ಠ So btrfs send
and btrfs check
detect corruption, but hashdeep
and btrfs scrub
do not. Burn it all down, btrfs sucks. Since I had all my data off the Samsung, so I did the usually risky move and ran btrfs check --repair --init-csum-tree
. If the files are good and the checksums are bad, let’s just make new checksums, WCGW! It barfed errors on to the screen for 10 minutes and the file system was rendered completely corrupt. Cool.
Memtest86
I turned my attention back to the Crucial. I had previously btrfs check
ed it and I remembered seeing >5 checksum errors. At this point I was contemplating my life choices. Maybe I should buy a Dell (dude), run Windows, and backup to a WD My Book with the included proprietary software. I ran btrfs check
just grasping at straws. There were <5 checksum errors. ಠ_ಠ Wait a minute. I ran btrfs check
probably 5 more times. Each time I got a different number of checksum errors. One time I got no errors! While I thought it was conceivable that an SSD could be malfunctioning in such a way that it returns random corruption, it’s too improbable for 2 completely different SSDs. It had to be something upstream of the SSD. Given all of my good experience with btrfs and trying 2 different kernel version (1 being an LTS kernel), it also seemed improbable there was something wrong with the software. The thought of ordering new memory, motherboard, and CPU?? to swap in trial-and-error style was just dreadful. The memory seemed like the most likely culprit. I’ve had a bad motherboard before and the symptoms are typically more catastrophic. I’ve known of memtest86 forever, but I’ve never actually used it. I threw it on a USB drive and lo and behold…
NEMIX?
With the benefit of hindsight, the memory being the culprit seems somewhat obvious. However, up until this point for me it was just a random assortment of seemingly unrelated problems. Even this memtest result was not a smoking gun. There could still have been a memory controller issue (in the CPU) or an issue with the motherboard. Only swapping parts would tell. I started with the memory and this time I was determined to use only ECC memory. I searched all the typical channels and found a single vendor in the US with 3200MHz ECC UDIMMs available to order in January 2021: NEMIX. There were few mentions of this vendor anywhere in the wild, so I ordered 1 pair as a test run. A couple weeks later Linus Tech Tips made a video about ECC on Ryzen and they not so coincidentally bought the exact same DIMMs. Kingston actually had 32GiB DIMMs for sale direct on their website, but I was concerned about the compatibility with Ryzen, which supports 64GiB total. At the time of this writing (February 2021) they have 16GiB (and not 32GiB DIMMs) available for direct order. I would have purchased those if they were available. If you’re reading this on a new Ryzen system without ECC go get them now!
The first pair of 16GiB DIMMs worked perfectly and passed a full multi-hour memtest run. They identified as Kingston via SPD and had Micron DRAM chips. It’s interesting that Micron’s own Crucial DIMMs with these chips are out of stock everywhere. I ordered a second kit for a total of 4 16GiB DIMMs. This second kit would not POST, even by themselves. This kit also used Micron DRAM chips. Despite the NEMIX DIMM label saying 3200MHz, the DRAM chips on this DIMM were 2666MHz.
You can look up the exact DRAM chip part number based on a code on the chip. The part numbers can be further decoded using the datasheet. Were the DIMMs honestly mislabeled or is there some funny business going on? Had they POSTed and just run at the lower speed then I’d believe someone just slapped the wrong label on the DIMM. But, if there were funny business going on, one would think they’d make some effort to hide the chip identification. In any event, NEMIX replaced the second kit with no fuss. The third kit came with 3200MHz modules and the SPD identified as PANRAM.
Kit | Code | Micron Part | Speed Rating | |
---|---|---|---|---|
First | [D9WFL] | MT40A1G8 SA-062E:E | 3200MHz | |
Second | [D9VHP] | MT40A1G8 SA-075:H | 2666MHz | |
Third | [Z9ZDQ] | MT40A1G8 JC-062E ES:P | 3200MHz |
There isn’t much information about NEMIX out there. I’m not sure how they come in to possession of the most rare DRAM chips in the world with a random assortment of integrator’s SPD EEPROMs. They seem to be legit, but it’s still a good idea to QC your purchase with MemTest86 and check the chips!