[ltp] Z61p GPU Thermal Issue

Brian D. Ropers-Huilman linux-thinkpad@linux-thinkpad.org
Fri, 31 Aug 2007 08:44:24 -0500


I have been fighting a graphics problem on my Z61p since I received it
at work, back in January. Very shortly after starting X with every
version of the fglrx driver I've tried (Kubuntu provided or built from
source), my system will randomly, yet gracefully, shutdown. I tracked
these down to a critical thermal event, with ACPI reporting a
temperature of 128C! I then started watching temperatures via ACPI
(watch acpi -t), but ACPI was only looking at CPU temperatures. With
the thinkpad_acpi module loaded, I, of course, have access to the
/proc/acpi/ibm/thermal file, which I finally took a look at, and it
includes the GPU temperature. This is where my problem lies.

Using vesa, I hover around 115C and 120C, which is awful hot to start
with in my opinion. I can "stress" the GPU just by grabbing a window
and "wiggling" it around, causing the temperature to rise 2-3 C. If I
watch a video at full screen, even with the vesa driver, I eventually
hit the critical 128 C (usually within 30-40 minutes) and the system
will gracefully shut itself down. I reach this point with the fglrx
driver just by starting a desktop environment, let alone really doing
anything graphics intensive.

Does anybody have any ideas on why the GPU runs so hot or on how to
mitigate this problem? I would love to run fglrx, allowing me to
finally run this machine in it's native wide-screen mode and,
admittedly, to do some beryl eye-candy goodness.

The Z61p has an ATI MOBILITY FireGL V5200 (M56GL, ID 0x71C4). I've run
vesa since I had the box (even though it still gives me problems),
I've tried many versions of fglrx (provided by Kubuntu and compiled
from source), as well as the new avivo driver (from a git tree after
I'd bumped my box to a Kubuntu v7.10 tribe 3 release, which included
the needed X server v1.3) and nothing really works.

I am very open to suggestions. I'm starting to go down the path of the
performance management tools in the newer fglrx drivers (throttling
the clock speed down to keep the GPU at a lower temperature). Does
anyone have any experience with this?

Just as an FYI, here's a capture of what I see:

bropers@isohel:~/fglrx/8.33.6$ uptime
 14:23:21 up 34 min,  7 users,  load average: 0.23, 0.15, 0.09
bropers@isohel:~/fglrx/8.33.6$ acpi -V
    Battery 1: charging, 98%, 00:17:25 until charged
    Thermal 1: ok, 49.0 degrees C
    Thermal 2: ok, 47.0 degrees C
 AC Adapter 1: on-line
bropers@isohel:~/fglrx/8.33.6$
Message from syslogd@isohel at Wed Feb 14 14:23:26 2007 ...
isohel kernel: [17181628.648000] Critical temperature reached (128 C),
shutting down.

NOTE: that's a whopping 5 seconds between seeing normal CPU
temperatures and a very light CPU load to reaching 128 C on the GPU
and ACPI shutting the system down.

Thanks, very, very, very much, in advance, if someone has any ideas.

-- 
Brian D. Ropers-Huilman