MESSAGE
DATE | 2017-01-19 |
FROM | Rick Moen
|
SUBJECT | Subject: [Hangout-NYLXS] RAM and RAM-testing
|
----- Forwarded message from Rick Moen -----
Date: Thu, 19 Jan 2017 13:02:03 -0800 From: Rick Moen To: Ruben Safir Subject: Re: (forw) [conspire] Old hardware, ridiculously old hardware: free RAM for you
Quoting Ruben Safir (ruben-at-mrbrklyn.com):
> On 01/19/2017 02:46 AM, Rick Moen wrote: > > through memtest86. Overnight, the machine hard-froze in that > > RAM-checker, so hard that even the keyboard's NumLock key didn't even > > toggle the LED. > > Oh god. It is painful to even think about running that test. I don't > think there is a memtest86 for 64 bit systems. Suse stopped including > it on disks because of some licensing issues with a new version.
I vaguely recall that there were at the time, and perhaps still are, two different forks of the memtest86 codebase (maybe memtest86 and memtest86+), and there were licensing issues with one but not the other. The bigger problem though, in my opinion, is that it's just not good enough! It finds gross errors but very often completely misses damage bad enough to cause NMIs, segfaults, system freezes, and spontaneous reboots. Whereas, parallel ('make -j N', for sufficiently large value of N to exercise all of one's RAM) iterative kernel compiles are completely successful -- in the sense that if you run them overnight and there's no system freeze or short uptime, you know you do _not_ have bad RAM.
If you _did_ have a system freeze during the night, or have short uptime, you now have a separate problem, that of isolating the problem to the particular hardware component at fault. It could be a single RAM stick. It could be multiple RAM sticks. It could be a motherboard circuitry fault. It could be RAM sticks not properly lodged in their sockets or corroded pins. It could even in theory be a marginal system PSU, or maybe even a dodgy expansion board, a bad RAM socket, or such a board not properly connecting in its socket. Isolating the root cause requires using logic and part removal/substitution. For example, I found the two bad 512MB ECC sticks by moving sticks to different sockets (which ruled out bad sockets), and by running with some sticks removed until I had a combination in which iterative parallel kernel compiles could proceed without freezes or spontaneous reboots.
It surprised me that there were _two_ bad sticks of RAM. Ordinarily, you would distrust a coincidence like that. However, I remembered that I got them for free because they had probably been data center pulls: Someone had probably yanked them from production servers because they were both suspect, and it turns out that the suspicion was well-founded. Fortunately, I still had the pair of 128MB ECC PC100 SDRAM sticks, and bought a substitute pair of 512MB ECC PC100 SDRAM sticks immediately after this testing. Throwing away what at the time was valuable RAM was painful, but not as painful as an unstable server.
As it turns out, I really should have remembered, back then in 2006, to look in the ECC event log, in the Intel L440GX+ 'Lancewood' motherboard's BIOS: IIRC, it would have a lot more quickly told me which stick in which RAM socket was having error problems. Part of my point in my 2017 (current) posting is that looking in the Supermicro's BIOS error log _did_ pinpoint the dodgy 4GB RAM stick -- after I remembered the lessons of ten years ago.
All of these things are particularly important to understand for those of us who accept donated hardware -- because often as not, the hardware we receive was set aside because someone suspected there might be a problem but lacked the time to do hardware diagnosis and find which part is causing the problem and what to do about it.
----- End forwarded message ----- _______________________________________________ hangout mailing list hangout-at-nylxs.com http://www.nylxs.com/
|
|