[ltp] Random rfkill toggling on an X201s

Henrique de Moraes Holschuh linux-thinkpad@linux-thinkpad.org
Sun, 26 Jun 2011 15:51:50 -0300


On Sun, 26 Jun 2011, Nathaniel Smith wrote:
> There is definitely an ACPI event being generated -- acpid says:
>   acpid: procfs received event "ibm/hotkey HKEY 00000080 00007000"
> which IIUC the rfkill slider on the side (in fact it gets two of these
> events, presumably one for turning on, and one for turning off).

Ok.  Now, the kernel can't cause these events to happen on purpose, so it is
either because the hardware kill line that the EC is paying attention to got
toggled twice, OR the ACPI DSDT is generating it twice for some reason (such
as bugs in the EC, in the ACPI DSDT, or in the kernel handling of ACPI GPEs
(ACPI interrupts)).

Please send me the full acpidump for your thinkpad, and the dmidecode
information.  Please cross-out UUIDs and serial numbers before you send
them.

> Not sure if you can do anything useful with it, but I'll attach the
> output of acpidump anyway. Just in case you collect them or something
> :-)

Ah, it is always useful, but I'd prefer if you send it and the dmidecode
info in a single email, so that I won't lose it later :p

> This doesn't seem to be consistent -- I had one this morning with 7
> seconds between the "disable" and the "enable". And a number more with
> other random amounts of time (usually between 10s and 100s of msec).

This is certainly weird.

> > If you have CONFIG_RFKILL_INPUT set, try to give the rfkill module the
> > parameter "master_switch_mode=0", and check for any behaviour change...
> 
> With rfkill.master_switch_mode=0 on the kernel command line, then yes,
> behavior is different.

Ok.  What happens when you set it to 0, is that the rfkill core won't try to
soft-unlock radios when the hardware lock is lifted (you'd need to
soft-unlock them by yourself).  This will remove noise from the logs.

> After this has happened, I have:
> ~$ rfkill list
> 2: phy0: Wireless LAN
> 	Soft blocked: yes
> 	Hard blocked: no
> 3: tpacpi_bluetooth_sw: Bluetooth
> 	Soft blocked: yes
> 	Hard blocked: no

Which is correct.

> So the rfkill core does seem to know that the switch has been toggled
> off, because the hard blocks are gone. If I then run 'sudo rfkill
> unblock 2', then the wireless card starts functioning normally (and
> again, there are no messages from thinkpad_acpi in the logs).

Which is also correct.

> ...Huh. And just now, it looks like it toggled on and off *twice*
> within 400 ms. And now I have a stack trace in my logs:
> rfkill_set_block -> iwl_mac_remove_interface -> wifi card's firmware
> crashed. I'll attach the log in case it's useful... notice that
> thinkpad_acpi doesn't get a word in until the very end.

To me it looks like something is messing with your WiFi's card hardware kill
line, and it it might be interfering with the EC too.  I do not think the EC
has an _output_ wired to that pin in WiFi, it is more likely that the swich
toggles a line to which both the WiFi card and the EC have high-impedance
input pins connected to.  And it will only work right as long as that
high-impedance rule is _not_ violated by a short, etc.

> So what I notice here is first, from looking at iwl-agn.c, the
> "RF_KILL bit toggled" message is based on the driver noticing a
> register change on the card, rather than a notification from the
> rfkill core, and second, the "HW:Kill SW:On" I *think* means that the
> firmware sent the driver a note saying that it was disabled
> specifically because of the hardware kill switch, as opposed to
> software. So this also suggests that either the wifi card is either
> wired in to the rf kill line directly or that the EC is playing games
> with us?

Yes, it looks like that indeed.  The two 7000 events you mentioned imply
that either the EC or the ACPI subsystem (not thinkpad-acpi, but rather the
Lenovo DSDT code or the ACPI GPE handling) got involved, so it is likely
not a problem inside the rfkill core.

> Is there anything further to be done on debugging, or should I start
> poking at the hardware?

At this moment, poking at the hardware is the more likely path to track down
the problem, yes.

However, whatever you do, don't do anything that would void your warranty.
If it is a problem in the planar card, you _WILL_ want that warranty to get
a replacement :-)

For starters, just remove the wifi driver with rmmod, it might be
instructive to see what happens, and whether the issue hits with the wifi
card supposedly disabled...

> You mentioned it was possible it was the wifi card that was bad and
> dragging the line down. Does this mean that a useful way to check that
> would be to pull the wifi card entirely, and then see if the rfkill
> switch continues to toggle by itself?

That is a very valid test.  And so is checking the integrity of the
mini-pci/mini-pcie connector that card is plugged to, cleaning the card
contacts and reseating it.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh