[ltp] Re: [ibm-acpi-devel] Recently identified ThinkPad mysteriouses

Thomas Renninger linux-thinkpad@linux-thinkpad.org
Thu, 22 Nov 2007 16:49:42 +0100


On Thu, 2007-11-22 at 12:53 -0200, Henrique de Moraes Holschuh wrote:
> On Thu, 22 Nov 2007, Thomas Renninger wrote:
> > I'd like to point you to some things I found out on ThinkPads the last
> > weeks:
> > 
> >  - IBM T41p shuts down, powersave, Temperature state changed to critical
> >    https://bugzilla.novell.com/show_bug.cgi?id=333043
> > 
> >    This affects a lot machines (T41(p), T42(p), T43(p), R40)
> >    I expect the real culprit is a confused EC (one machine had the
> >    problem, the other one not).
> >    Anyway, it seems ACPI notifies had higher priority (or did not get
> >    scheduled away) in former kernels. Therefore the BIOS could still
> >    avoid a critical thermal shutdown through lowering CPU frequency
> >    (_PPC) interface on older kernels, but something seem to have
> >    changed there...
> 
> 
> Argh.  Well, thinkpad-acpi fortunately has absolutely nothing to add to the
> picture (unless the misterious HKEY 0x6011/0x6012 events are actually
> thermal warning events).  I kept well away from the standard thermal ACPI
> interface, since there didn't seem to be a reason to muck with it.  Now I am
> not so sure anymore.
> 
> What I do know: the T41 and T42 BIOSes and EC firmware *are the same*, down
> to the bit level.  So if there is a difference between the T41 and T42
> behaviour, it is very likely either a hardware fault, or one of the two is
> NOT using the latest BIOS.  If they are indeed using the latest BIOS, it is
> a BIOS bug (the BIOS does know if it is in a T40, T41 or T42, and could act
> in a different way) in its SMI management routines.  The only hope is to beg
> Lenovo for a fix.
I described that wrong: "one machine had the problem, the other one not"
It's exactly the same model. In fact exactly the same machine had
temperature problems and then it worked again...
I expect an EC firmware bug surviving a reboot (just guessing),
something really odd is going on on this machines, but it looks very
much as if the EC data is already exporting too high values (the AML
code at this place is quite simple), starting with over 40 degrees after
a cold boot.


> Thermal problems with the T4x are common, and if they developed after the
> machine was used for a few years (as opposed to a bad manufacturing issue
> that it had since day one), they are usually user-serviceable.  Remove the
> entire heatsink assembly, and very carefully and very throughoutly replace
> any already existing or missing thermal glue with very high grade thermal
> cooling paste, the type you'd use for serious overclocking (Artic Silver 5
> or better).  Of course, check to make sure the fan is working properly.
> This is no milkrun, and it takes a few hours and a lot of patience if one
> wants to do it perfectly.
Yep, thanks. I also fixed one (a long while ago) by using a vacuum
cleaner :)

> Unless you are in a really hot place (35C or above), in my experience a T4x
> notebook with the entire heatsink system working at top condition does NOT
> reach critical temperatures, even while working at full CPU load.  But it is
> really easy for a T4x to have their heatsink system at far below the top
> condition :(
> 
> There is one thing I can help with.  ThinkPad ACPI knows how to enable a
> "ludicrous speed" fan mode, aka "disengaged" mode or "full speed" mode,
> which is typically at least twice as fast as the highest fan mode the EC
> likes to trigger, even at thermal emergencies.  Give me a trigger inside the
> kernel, and I will kick it on during thermal emergencies, regardless of
> fan_control status.   Userspace already has control over this if
> fan_control=1 is specified on module load (just disable PWM through the
> hwmon interface, and it will kick the fan into full speed mode).
Yes, it's a write to the EC's fan register, I know that, not a real
option...

> But really, if you have a need of the full speed fan mode regularly, it
> either means your ThinkPad is in need of repair to the heatsink system, or
> it means you are in such a hot climate you'd better be playing at the beach
> instead of using a laptop.

Thanks. I am finish with this one anyway..., I just want to let you know
about the outcome. I really expect HW, means EC firmware failure (maybe
triggered by latest kernels, but I doubt it as I also got several bug
reports about these over quite a long timeframe).
But a ACPI notifier change might bring this back to current kernels.
Several people reported 10.2 working, 10.3 not. It seems that if too
much processes are running, it takes too long until CPU frequency gets
lowered through a CPU ACPI event.
I could imagine some people of this bug (yeah, too much people with
different problems, still...) are seeing exactly this:
Arghhh, I don't find it anymore :) Got dozens of mails..., IIRC it was
an Ubuntu critical thermal shutdown bug on launchpad.net...

It's workarounded in SL 10.3 by an ugly blacklist now, lowering passive
trip points and enable thermal polling for them...
I wonder if this could get a mainline solution if they really hit some
strange EC firmware bug on linux... Having all ACPI notify processes
with highest prio (accessing slow HW..) might not be a good idea...
I'll watch this for a while...

> >  - Latest Lenovo ThinkPads do not like ACPI EC writes for brightness
> >    switching (s2ram broken)
> 
> That's because you must not do EC writes to change brightness on these boxes
> :-)  Latest thinkpad-acpi in ibm-acpi.sf.net should get this right already,
> and use only the UCMS ACPI method to modify brightness levels on Lenovo
> boxes.
> 
> >    I've been told it only happens when brightness is set above 7.
> >    Don't know, but it seems to be the EC writes > 7 that lets the
> >    machine not wake up anymore from a s2ram.
> 
> Well, latest X.org really, really dislikes anyone messing with the backlight
> brightness behind its back, so make sure it is not related to that problem
> as well.
> 
> I am starting to talk to the X.org people to see how I can be told by the X
> server that it is active, and completely switch thinkpad-acpi to "userspace
> is in control" backlight mode...
> 
> >  - Weird USB - EHCI IRQ problems on very latest Lenovo models
> > 
> >    Unhandled IRQ messages. Looks like an IRQ (from camera?) gets routed
> >    to EHCI pin also? Don't know the details and I am also not very
> >    familar with this..., Oliver might be able to point you to bug
> >    reports, AFAIK there also exist kernel.org bugs.
> >    This is not solved yet, AFAIK.
> >    If anyone knows more about this problem, that would be great...
> 
> Indeed.  And if we do get precise enough descriptions of the issue, it is
> something that we should try to find a way to forward to Lenovo for a fix.
> 
> >    Attaching an USB device breaks USB and throws a calltrace
> >    https://bugzilla.novell.com/show_bug.cgi?id=325601
> > 
> >    As this seem to be a very (recent Lenovo) ThinkPad specific problem
> >    -> Only happens on latest ThinkPads, but back to at least
> >    kernel 2.6.16 until latest mainline... I hope to get some
> >    feedback from Lenovo ThinkPad users also seeing this.
> 
> Same comment as above.

AFAIK Lenovo is involved... any additional input, findings, etc.,  would
be great...

Apropos..., you probably know that already: Lenovo offers a new BIOS for
latest T61 and other latest models with a lot worthy fixes. Especially
the AHCI fix speeds up the machines a lot I've heard...

   Thomas