[ltp] Java Zombies

Henrique de Moraes Holschuh linux-thinkpad@linux-thinkpad.org
Wed, 4 Oct 2006 17:42:54 -0300


On Wed, 04 Oct 2006, Martin Lorenz wrote:
> > Your kernel is hosed, and filesystems might be too.  This is very very bad,
> 
> is this what you mean by 'hosed'?
> http://www.catb.org/jargon/html/H/hosed.html

The mainstream sense of it, yes:  Hosed as in "broken", "defective",
"bleeding all over the place", "painfully puking its innards on the floor",
etc :-)

> you make me shiver!

It is just standard recovery when you get misterious tasks getting stuck in
a kernel.  You need to test the memory for a small while because that's a
typical reason for filesystem corruption, and you need to test the
filesystems with a kernel that is not borken to make sure they are fine.

I don't know about reiserfs, but if it is ext3, the chances of data loss are
minimal, unless it *was* bad memory, then all bets are off.

> and except for the java zombies and a loss of acpi events after suspend that
> happened with an older kernel but noch anymore with this one I don't
> experience errors that make me think of hardware defects

Well, test the memory.  If it is ok, the chances of a hardware defect are
low.

> > > [ 2134.493000] thinkpad_ec: thinkpad_ec_request_row: bad end STR3: (0x11:0x00)->0x80
> > 
> > Not Good!  Remove thinkpad_ec for now on this machine.  We can work on that
> > angle later.
> 
> what does it try to tell me?

That it is having trouble talking to the EC, which is *not* a good thing,
and depending on why it is happening, it can easily lead to hard lockups of
the ThinkPad.  It should not get a 0x80 back from the EC LPC3 status
register after a transaction, so it is telling you something is very amiss.

Make very sure you don't have somehow mixed unpatched HDAPS with tp_smapi,
or that you don't have different versions of the hdaps, tp_smapi and
thinkpad_ec modules mixed.

> I will go and bake a naked kernel now, but please give me some hints, what
> makes you think it is THAT bad?

Processes being stuck in unkillable state in Linux are usually caused by
either filesystem corruption (which causes the kernel to oops and lock a
thread in D state) or kernel bugs.  It is *not* certain at this point where
the problem lies, which is why I suggested you do a filesystem check to make
sure your data is safe.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh