Discussion:
Logs and dumps for kernel panics to collect and analyze?
(too old to reply)
Ant
2010-03-06 07:12:30 UTC
Permalink
Hello.

Is /var/log/syslog the only place where Linux keeps records of kernel
(v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem
to show anything about the crashes unless I am misreading them. I am
trying to figure out a rare and random kernel panic issue on my old
Debian box.

I know it's not X because I exited it, logged out of bash, went into
fullscreen text console's login screen (I boot up my Debian to text
mode, log into bash, and use startx command to go to X), and saw a bunch
of datas (e.g., memory addresses and codes) on my screen from the kernel
crash. However, its data dump was too long and my computer was in frozen
mode with two blinking PS/2 keyboard lights (caps and scroll lock) so I
couldn't scroll up or copy and paste.

I poked around in my Debian and on the Web. I read that kernel panic
errors/datas can be found in /var/log/syslog (dmesg didn't show me
anything related to Kernel panics that I could find) like:

# cat /var/log/syslog
...
Mar 4 23:12:07 foobar smartd[2647]: Device: /dev/hda, SMART Usage
Attribute: 194 Temperature_Celsius changed from 30 to 31
...
Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 58 to 59
Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
Mar 5 15:15:01 foobar /USR/SBIN/CRON[8815]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:17:01 foobar /USR/SBIN/CRON[11199]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Mar 5 15:25:01 foobar /USR/SBIN/CRON[20721]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:35:01 foobar /USR/SBIN/CRON[32588]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:45:01 foobar /USR/SBIN/CRON[12129]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:55:01 foobar /USR/SBIN/CRON[23947]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
< rebooted my crashed PC from its kernel panic >
Mar 5 21:05:19 foobar syslogd 1.5.0#5: restart.
...

I couldn't find any similiar from an earlier one like (don't think
smartctl with /dev/hda is it?):
...
Mar 5 05:17:01 foobar /USR/SBIN/CRON[26833]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Mar 5 05:25:01 foobar /USR/SBIN/CRON[29514]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 05:35:01 foobar /USR/SBIN/CRON[372]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 05:45:01 foobar /USR/SBIN/CRON[3772]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 05:55:01 foobar /USR/SBIN/CRON[7160]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 06:41:19 foobar syslogd 1.5.0#5: restart.
...

I saw LKCD (http://lkcd.sourceforge.net/ and
http://sourceforge.net/projects/lkcd/files/), but it seems to be
outdated? I also couldn't find a Debian package of it, so I don't know
if I should even try it to get more datas.

And yes, I already tried memtest86+ v4.00 and it came out no errors
after six hours with its default tests. I will try it again later just
in case.

Thank you in advance. :)
--
"If I find one beer can in that car, it's over!" --Red; "And no donuts
either! Ants!" --Kitty from That '70s Show pilot
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: ***@earthlink.netANT
( ) or ***@zimage.com
Ant is currently not listening to any songs on his home computer.
Mark Hobley
2010-03-06 10:08:02 UTC
Permalink
Post by Ant
Is /var/log/syslog the only place where Linux keeps records of kernel
(v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem
to show anything about the crashes unless I am misreading them. I am
trying to figure out a rare and random kernel panic issue on my old
Debian box.
Is it a panic, or is it a crash?
Post by Ant
I know it's not X because I exited it, logged out of bash, went into
fullscreen text console's login screen (I boot up my Debian to text
mode, log into bash, and use startx command to go to X), and saw a bunch
of datas (e.g., memory addresses and codes) on my screen from the kernel
crash.
What type of computer are you using? What type of CPU does it have?
There are some issues with invalid instructions being embedded into the kernel
which causes a similar problem on IA32 compatible machines. (A 486 build
contains non-IA32 instructions).

Write down what you see, and lose the rest. The information does not get
logged.

Mark.
--
Mark Hobley
Linux User: #370818 http://markhobley.yi.org/
Ant
2010-03-06 15:33:12 UTC
Permalink
Post by Mark Hobley
Post by Ant
Is /var/log/syslog the only place where Linux keeps records of kernel
(v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem
to show anything about the crashes unless I am misreading them. I am
trying to figure out a rare and random kernel panic issue on my old
Debian box.
Is it a panic, or is it a crash?
Panic since my box frozen, required a reboot, and had flashing keyboard
lights.
Post by Mark Hobley
Post by Ant
I know it's not X because I exited it, logged out of bash, went into
fullscreen text console's login screen (I boot up my Debian to text
mode, log into bash, and use startx command to go to X), and saw a bunch
of datas (e.g., memory addresses and codes) on my screen from the kernel
crash.
What type of computer are you using? What type of CPU does it have?
There are some issues with invalid instructions being embedded into the kernel
which causes a similar problem on IA32 compatible machines. (A 486 build
contains non-IA32 instructions).
http://alpha.zimage.com/~ant/antfarm/about/computers.txt for my
secondary computer specifications. I am currently using 2.6.32-trunk-686
#1 SMP Sun Jan 10 06:32:16 UTC 2010 i686 GNU/Linux. I was using 2.6.30
too and it had the same problem.
Post by Mark Hobley
Write down what you see, and lose the rest. The information does not get
logged.
Darn. That's a lot stuff to write down.
--
"To the ant, a few drops of dew is a flood." --Iranian
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: ***@earthlink.netANT
( ) or ***@zimage.com
Ant is currently not listening to any songs on his home computer.
Ant
2010-03-06 17:05:12 UTC
Permalink
Uh oh. I just discovered mcelog and something new and scary in its
/var/log/syslog:

Mar 6 01:19:37 foobar kernel: [15299.988025] Machine check events logged
Mar 6 01:42:07 foobar kernel: [16649.989021] Machine check events logged
Mar 6 02:05:19 foobar -- MARK --
Mar 6 02:19:37 foobar kernel: [18899.989024] Machine check events logged
Mar 6 02:37:07 foobar kernel: [19949.988027] Machine check events logged
Mar 6 03:05:19 foobar -- MARK --
Mar 6 03:24:37 foobar kernel: [22799.989023] Machine check events logged
Mar 6 03:45:19 foobar -- MARK --
Mar 6 04:05:19 foobar -- MARK --
Mar 6 04:25:19 foobar -- MARK --
Mar 6 04:45:19 foobar -- MARK --
Mar 6 05:02:07 foobar kernel: [28649.989023] Machine check events logged
Mar 6 05:25:19 foobar -- MARK --
Mar 6 05:45:19 foobar -- MARK --
Mar 6 06:05:19 foobar -- MARK --
Mar 6 06:24:37 foobar kernel: [33599.989027] Machine check events logged
Mar 6 06:33:13 foobar syslogd 1.5.0#5: restart.
Mar 6 06:45:19 foobar -- MARK --
Mar 6 07:05:19 foobar -- MARK --
Mar 6 07:25:19 foobar -- MARK --
Mar 6 07:45:19 foobar -- MARK --
Mar 6 08:05:19 foobar -- MARK --
Mar 6 08:17:07 foobar kernel: [40349.989022] Machine check events logged
Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
Mar 6 08:45:19 foobar -- MARK --
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 0
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 1
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 2
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 3
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 4
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 5
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 6
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 7
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 8
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43


What does that mean? Dying CPU (had it since 12/24/2006)? Maybe that's
why memtest86+ didn't find any problems last week.
Post by Ant
Hello.
Is /var/log/syslog the only place where Linux keeps records of kernel
(v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem
to show anything about the crashes unless I am misreading them. I am
trying to figure out a rare and random kernel panic issue on my old
Debian box.
I know it's not X because I exited it, logged out of bash, went into
fullscreen text console's login screen (I boot up my Debian to text
mode, log into bash, and use startx command to go to X), and saw a bunch
of datas (e.g., memory addresses and codes) on my screen from the kernel
crash. However, its data dump was too long and my computer was in frozen
mode with two blinking PS/2 keyboard lights (caps and scroll lock) so I
couldn't scroll up or copy and paste.
I poked around in my Debian and on the Web. I read that kernel panic
errors/datas can be found in /var/log/syslog (dmesg didn't show me
# cat /var/log/syslog
...
Mar 4 23:12:07 foobar smartd[2647]: Device: /dev/hda, SMART Usage
Attribute: 194 Temperature_Celsius changed from 30 to 31
...
Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 58 to 59
Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
Mar 5 15:15:01 foobar /USR/SBIN/CRON[8815]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:17:01 foobar /USR/SBIN/CRON[11199]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Mar 5 15:25:01 foobar /USR/SBIN/CRON[20721]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:35:01 foobar /USR/SBIN/CRON[32588]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:45:01 foobar /USR/SBIN/CRON[12129]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 15:55:01 foobar /USR/SBIN/CRON[23947]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
< rebooted my crashed PC from its kernel panic >
Mar 5 21:05:19 foobar syslogd 1.5.0#5: restart.
...
I couldn't find any similiar from an earlier one like (don't think
...
Mar 5 05:17:01 foobar /USR/SBIN/CRON[26833]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Mar 5 05:25:01 foobar /USR/SBIN/CRON[29514]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 05:35:01 foobar /USR/SBIN/CRON[372]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 05:45:01 foobar /USR/SBIN/CRON[3772]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 05:55:01 foobar /USR/SBIN/CRON[7160]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 5 06:41:19 foobar syslogd 1.5.0#5: restart.
...
I saw LKCD (http://lkcd.sourceforge.net/ and
http://sourceforge.net/projects/lkcd/files/), but it seems to be
outdated? I also couldn't find a Debian package of it, so I don't know
if I should even try it to get more datas.
And yes, I already tried memtest86+ v4.00 and it came out no errors
after six hours with its default tests. I will try it again later just
in case.
Thank you in advance. :)
--
"What is it going to be like in eternity with God? Frankly, the capacity
of our brains cannot handle the wonder and greatness of heaven. It would
be like trying to describe the Internet to an ant." --Rick Warren's
book, The Purpose Driven Life
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: ***@earthlink.netANT
( ) or ***@zimage.com
Ant is currently not listening to any songs on his home computer.
Darren Salt
2010-03-06 22:13:12 UTC
Permalink
I demand that Ant may or may not have written...
Post by Ant
Uh oh. I just discovered mcelog and something new and scary in its
[snip]
Post by Ant
Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
Mar 6 08:45:19 foobar -- MARK --
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 0
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
[snip duplicate entries]

Ouch.
Post by Ant
What does that mean? Dying CPU (had it since 12/24/2006)?
12/12/2007? ;-)

(Hint: use ISO8601 date formats or use month names. Broken-endian dates can
all too easily cause error; fortunately, that one's unambiguous.)

Anyway, it does look like a fault in that CPU. I'd certainly be considering
replacing it, though due to your earlier mention of kernel panics, I wouldn't
rule out board problems either; are there any visible signs of hardware
problems (leaky/bulging capacitors etc.)? Checking the PSU is probably also
worthwhile.

(http://en.wikipedia.org/wiki/Translation_lookaside_buffer describes the
affected area of the CPU.)
Post by Ant
Maybe that's why memtest86+ didn't find any problems last week.
That doesn't seem to be relevant.
(And that one hasn't happened yet.)
Post by Ant
Post by Ant
Is /var/log/syslog the only place where Linux keeps records of kernel
(v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem to
show anything about the crashes unless I am misreading them. I am trying
to figure out a rare and random kernel panic issue on my old Debian box.
http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt

That needs a second computer, but it will at least allow most panics to be
captured. (Exceptions include hard hangs, where there may be no panic which
can be reported, and problems which affect the network interface over which
the log is being sent.)

[snip]
--
| Darren Salt | linux at youmustbejoking | nr. Ashington, | Doon
| using Debian GNU/Linux | or ds ,demon,co,uk | Northumberland | Army
| + They're after you...

I'd like to, but I'm going to count the bristles in my toothbrush.
Ant
2010-03-07 08:13:24 UTC
Permalink
Post by Darren Salt
Post by Ant
Uh oh. I just discovered mcelog and something new and scary in its
[snip]
Post by Ant
Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
Mar 6 08:45:19 foobar -- MARK --
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 0
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
[snip duplicate entries]
Ouch.
:(
Post by Darren Salt
Post by Ant
What does that mean? Dying CPU (had it since 12/24/2006)?
12/12/2007? ;-)
Eh?
Post by Darren Salt
(Hint: use ISO8601 date formats or use month names. Broken-endian dates can
all too easily cause error; fortunately, that one's unambiguous.)
I don't get it. :(
Post by Darren Salt
Anyway, it does look like a fault in that CPU. I'd certainly be considering
replacing it, though due to your earlier mention of kernel panics, I wouldn't
rule out board problems either; are there any visible signs of hardware
problems (leaky/bulging capacitors etc.)? Checking the PSU is probably also
worthwhile.
Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
watts) from 5/14/2007) died on 12/2009. I recalled days before,
something smelled burning but I couldn't figure out where it came from
since I had two desktops. I guess it was the PSU that went poof!

At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed
since it didn't work anymore since the new PSU still wouldn't boot the
box up at all. After getting a RMA'ed refurbished video card back, my
box was fine for a bit and then got kernel panics once in a while. Then,
it seems to become more frequently slowly. One day in February, I ran
memtest86+ v4.00 for like five hours and found lots of errors. My friend
and I narrowed it down to a 512 MB RAM and left with 2.5 GB remaining
(still plenty for an old Linux workstation!). Oh and we didn't see
anything burned, busted, etc.

It sounds like that PSU bust damaged a lot of my hardwares. Argh! I
don't have the time and resources to build another one (guess I could do
a clean install with it too :P). :(
Post by Darren Salt
(http://en.wikipedia.org/wiki/Translation_lookaside_buffer describes the
affected area of the CPU.)
Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
wasn't bad?
Post by Darren Salt
Post by Ant
Maybe that's why memtest86+ didn't find any problems last week.
That doesn't seem to be relevant.
Why do you say that? I am going to run it again soon to double check.
Post by Darren Salt
http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
That needs a second computer, but it will at least allow most panics to be
captured. (Exceptions include hard hangs, where there may be no panic which
can be reported, and problems which affect the network interface over which
the log is being sent.)
Interesting. I wished Linux's Kernel panics would log to a file like
Windows' memory dumps from blue screens so I can use a debugger to see
what the dumps.
--
"Left right left right we're army ants. We swarm we fight. We have no
home. We roam. We race. You're lucky if we miss your place." --Douglas
Florian (The Army Ants Poem)
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: ***@earthlink.netANT
( ) or ***@zimage.com
Ant is currently not listening to any songs on his home computer.
Darren Salt
2010-03-07 13:52:51 UTC
Permalink
I demand that Ant may or may not have written...
Post by Ant
Post by Darren Salt
Post by Ant
Uh oh. I just discovered mcelog and something new and scary in its
[snip]
Post by Ant
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 0
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
[snip duplicate entries]
Ouch.
:(
Post by Darren Salt
Post by Ant
What does that mean? Dying CPU (had it since 12/24/2006)?
12/12/2007? ;-)
Eh?
Normalisation in progress. ;-)
Post by Ant
Post by Darren Salt
(Hint: use ISO8601 date formats or use month names. Broken-endian dates
can all too easily cause error; fortunately, that one's unambiguous.)
I don't get it. :(
Well... today is 7/3/2010 or 3/7/2010, according to locale; it is better
represented as 2010-03-07.
Post by Ant
Post by Darren Salt
Anyway, it does look like a fault in that CPU. I'd certainly be
considering replacing it, though due to your earlier mention of kernel
panics, I wouldn't rule out board problems either; are there any visible
signs of hardware problems (leaky/bulging capacitors etc.)? Checking the
PSU is probably also worthwhile.
Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
watts) from 5/14/2007) died on 12/2009. I recalled days before,
something smelled burning but I couldn't figure out where it came from
since I had two desktops. I guess it was the PSU that went poof!
I've had that happen once here. Advice given was to replace the whole lot
because of possible damage to components, and I can see where that's coming
from: brief over-voltage or over-current. (Would anybody who knows more about
your typical switched-mode PSU care to comment?)
Post by Ant
At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed since
it didn't work anymore since the new PSU still wouldn't boot the box up at
all.
Dead card, due to The Way of the Exploding PSU?
Post by Ant
After getting a RMA'ed refurbished video card back, my box was fine for a
bit and then got kernel panics once in a while. Then, it seems to become
more frequently slowly. One day in February, I ran memtest86+ v4.00 for
like five hours and found lots of errors. My friend and I narrowed it down
to a 512 MB RAM
I've seen bad RAM before. On visual inspection, it looks exactly like good
RAM.
Post by Ant
and left with 2.5 GB remaining (still plenty for an old
Linux workstation!). Oh and we didn't see anything burned, busted, etc.
That's the thing. It might not *look* damaged...
Post by Ant
It sounds like that PSU bust damaged a lot of my hardwares. Argh! I
don't have the time and resources to build another one
Yet you have the time to respond here. ;-)
Post by Ant
(guess I could do a clean install with it too :P). :(
Hmm...
Post by Ant
Post by Darren Salt
(http://en.wikipedia.org/wiki/Translation_lookaside_buffer describes the
affected area of the CPU.)
Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
wasn't bad?
Chances are that memtest86 was right. (I can see how bad memory might cause
incorrect TLB entries, but not parity errors.)
Post by Ant
Post by Darren Salt
Post by Ant
Maybe that's why memtest86+ didn't find any problems last week.
That doesn't seem to be relevant.
Why do you say that? I am going to run it again soon to double check.
It's testing the memory, and (probably) isn't making use of logical
addressing. If it isn't, then it's not going to be making use of the TLB, so
it's not going to cause MCEs. (Or perhaps they *were* happening, but
memtest86+ was ignoring them.)
Post by Ant
Post by Darren Salt
http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
That needs a second computer, but it will at least allow most panics to be
captured. (Exceptions include hard hangs, where there may be no panic
which can be reported, and problems which affect the network interface
over which the log is being sent.)
Interesting. I wished Linux's Kernel panics would log to a file like
Windows' memory dumps from blue screens so I can use a debugger to see what
the dumps.
Logging to a file isn't an option (at this point, things are probably too far
gone for this to be practical); but they could, perhaps, be stored in some
non-volatile memory. (You'd need at least 16K for this, ideally 64K or more;
and I don't think that there's enough in your typical PC RTC.)
--
| Darren Salt | linux at youmustbejoking | nr. Ashington, | Doon
| using Debian GNU/Linux | or ds ,demon,co,uk | Northumberland | Army
| + This comment has been censored.

Would ye both eat your cake and have your cake?
Ant
2010-03-07 16:00:52 UTC
Permalink
Post by Darren Salt
Post by Ant
Post by Darren Salt
(Hint: use ISO8601 date formats or use month names. Broken-endian dates
can all too easily cause error; fortunately, that one's unambiguous.)
I don't get it. :(
Well... today is 7/3/2010 or 3/7/2010, according to locale; it is better
represented as 2010-03-07.
OH! Bah, I am an American. :P
Post by Darren Salt
Post by Ant
Post by Darren Salt
Anyway, it does look like a fault in that CPU. I'd certainly be
considering replacing it, though due to your earlier mention of kernel
panics, I wouldn't rule out board problems either; are there any visible
signs of hardware problems (leaky/bulging capacitors etc.)? Checking the
PSU is probably also worthwhile.
Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
watts) from 5/14/2007) died on 12/2009. I recalled days before,
something smelled burning but I couldn't figure out where it came from
since I had two desktops. I guess it was the PSU that went poof!
I've had that happen once here. Advice given was to replace the whole lot
because of possible damage to components, and I can see where that's coming
from: brief over-voltage or over-current. (Would anybody who knows more about
your typical switched-mode PSU care to comment?)
:( It sounds common I guess. I ran memtest86+ v4.000 overnight for over
five hours. It had two passes and almost done with the third one on its
test 8. I guess RAM is still OK!
Post by Darren Salt
Post by Ant
At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed since
it didn't work anymore since the new PSU still wouldn't boot the box up at
all.
Dead card, due to The Way of the Exploding PSU?
I guess so if it stopped working right after PSU went dead and repalced
with a new one. Or a coincident?
Post by Darren Salt
Post by Ant
After getting a RMA'ed refurbished video card back, my box was fine for a
bit and then got kernel panics once in a while. Then, it seems to become
more frequently slowly. One day in February, I ran memtest86+ v4.00 for
like five hours and found lots of errors. My friend and I narrowed it down
to a 512 MB RAM
I've seen bad RAM before. On visual inspection, it looks exactly like good
RAM.
Yeah. It's old too (four years I think)!
Post by Darren Salt
Post by Ant
and left with 2.5 GB remaining (still plenty for an old
Linux workstation!). Oh and we didn't see anything burned, busted, etc.
That's the thing. It might not *look* damaged...
Right, but you asked if there were any physical damages from our eyes. :P
Post by Darren Salt
Post by Ant
It sounds like that PSU bust damaged a lot of my hardwares. Argh! I
don't have the time and resources to build another one
Yet you have the time to respond here. ;-)
That's faster. Sometimes I do it from work too. :P
Post by Darren Salt
Post by Ant
Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
wasn't bad?
Chances are that memtest86 was right. (I can see how bad memory might cause
incorrect TLB entries, but not parity errors.)
So parity errors are from CPU only? I am not an expert in hardwares area.
Post by Darren Salt
Post by Ant
Post by Darren Salt
Post by Ant
Maybe that's why memtest86+ didn't find any problems last week.
That doesn't seem to be relevant.
Why do you say that? I am going to run it again soon to double check.
It's testing the memory, and (probably) isn't making use of logical
addressing. If it isn't, then it's not going to be making use of the TLB, so
it's not going to cause MCEs. (Or perhaps they *were* happening, but
memtest86+ was ignoring them.)
So how can I test this with another bootable tool like memtest86+?
Post by Darren Salt
Post by Ant
Interesting. I wished Linux's Kernel panics would log to a file like
Windows' memory dumps from blue screens so I can use a debugger to see what
the dumps.
Logging to a file isn't an option (at this point, things are probably too far
gone for this to be practical); but they could, perhaps, be stored in some
non-volatile memory. (You'd need at least 16K for this, ideally 64K or more;
and I don't think that there's enough in your typical PC RTC.)
Bummer. I am surprised Linux doesn't do this, but MS does with its
NT-based Windows.
--
"To the gods I am an ant, but to the ants, I am a god." --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: ***@earthlink.netANT
( ) or ***@zimage.com
Ant is currently not listening to any songs on his home computer.
Darren Salt
2010-03-08 03:08:44 UTC
Permalink
I demand that Ant may or may not have written...
[snip]
Post by Ant
Post by Darren Salt
Post by Ant
Post by Darren Salt
Anyway, it does look like a fault in that CPU. I'd certainly be
considering replacing it, though due to your earlier mention of kernel
panics, I wouldn't rule out board problems either; are there any visible
signs of hardware problems (leaky/bulging capacitors etc.)? Checking the
PSU is probably also worthwhile.
Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
watts) from 5/14/2007) died on 12/2009. I recalled days before,
something smelled burning but