[ltp] Re: [ibm-acpi-devel] Recently identified ThinkPad mysteriouses

Henrique de Moraes Holschuh linux-thinkpad@linux-thinkpad.org
Thu, 22 Nov 2007 21:06:00 -0200


On Thu, 22 Nov 2007, Thomas Renninger wrote:
> I described that wrong: "one machine had the problem, the other one not"
> It's exactly the same model. In fact exactly the same machine had
> temperature problems and then it worked again...
> I expect an EC firmware bug surviving a reboot (just guessing),

I know of an easy to trigger EC firmware bug, and I can tell you for sure,
it will survive anything but a power down in a IBM thinkpad with it.  Not
that you will have much choice, it will send the box to lunch not much later
after you hit it.

> something really odd is going on on this machines, but it looks very
> much as if the EC data is already exporting too high values (the AML
> code at this place is quite simple), starting with over 40 degrees after
> a cold boot.

Might be a hardware problem in the A/D converter for one or more of the
thermal sensors.  Cold solders can cause severe trouble in some A/D designs,
and cold solders are a known endemic problem of the thinkpad populations :-)

Were it an EC firmware bug, it would be quite widespread...

> Thanks. I am finish with this one anyway..., I just want to let you know
> about the outcome. I really expect HW, means EC firmware failure (maybe
> triggered by latest kernels, but I doubt it as I also got several bug
> reports about these over quite a long timeframe).

Well, I have never seen anything like it on the T43p with the latest BIOS,
and the EC itself seems to switch the fan to faster modes as the temperature
rises, without any help from the BIOS or AML.

So we have a bug in the fallback net, for when the cooling is simply not
enough (i.e. defective) and dT/dt rises too fast for the ACPI THM drivers to
respond.

I could easily add a monitor watchdog for this in thinkpad-acpi, and kick
the emergency fan level (level 7) at 75C, and the full speed fan level at
80C, plus issuing a cpufreq override (if such a thing is possible).  The
optimum poll rate for the thermal sensors in an emergency situation on the
T4x line is 0.5Hz.  Would you consider this a worthwile addition to
thinkpad-acpi?

I doubt it would help much on the new Lenovos with built-in CPU thermal
sensors (Intel Core and Core2) though, since thinkpad-acpi doesn't access
it, but it would do the trick for IBM thinkpads.

> Several people reported 10.2 working, 10.3 not. It seems that if too
> much processes are running, it takes too long until CPU frequency gets
> lowered through a CPU ACPI event.

The thermal control should run at low RT priority, at the very least.  If it
is not running with that much priority, we should fix it.

> It's workarounded in SL 10.3 by an ugly blacklist now, lowering passive
> trip points and enable thermal polling for them...

That is a valid workaround, as far as I am concerned.

> Apropos..., you probably know that already: Lenovo offers a new BIOS for
> latest T61 and other latest models with a lot worthy fixes. Especially
> the AHCI fix speeds up the machines a lot I've heard...

I am severely tempted to issue severe warnings on thinkpad-acpi if the boxes
are running some known-troublesome outdated BIOSes, especially on the Lenovo
boxes.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh