What would you do when your embedded Linux suddenly starts to break in mysterious ways? I was faced by this problem “recently” and this is a compilation of my notes when faced with this somewhat strange problem.
I worked for a customer who has a small Arm system booting of a NAND flash and then serving a few peripherals over TCP/IP, RS232/485 etc. The system also runs a small web server to allow easy configuration of the system.
The system is developed by people no longer involved in the project and I was hired to do mainly bug fixing and minor adjustments to the system. During this I stumbled on a problem so interesting that I think it’s worth a blurb.
When the system booted it was noted that occasionally the system would not show the login page when accessing it via web interface. Instead the user was greeted with an error message when trying to log in.
At first i thought that this was just a simple CGI script that bugged out for some reason. Man, was i wrong.
Doing some initial investigation indicated that it was the login script that for some reason produced “illegal” output making the web server throw up. But then when running the CGI on the command line the first surprise showed up:
/root # export QUERY_STRING="xxx-deleted-xxxx"
/root # python /usr/lib/cgi-bin/login.py
ILL
/root #
The script halted with an illegal instruction! This is nothing you see everyday with a python script. My first suspicion was that for some reason the pre-compilation of modules failed, which is done during boot up of the system, and I tried manually doing this. Everything worked.
Manually starting python also worked and produced a working prompt. But launching the CGI would not work. Reducing the content of the script trying to deduce what made it crash. It turned out that importing urllib or the CGI module made the interpreter crash with an illegal instruction whilst other imports worked nicely.
/root # python -c "import cgi"
ILL
/root # python -c "import os"
/root #
Rerunning the import with verbose output:
/root # python -v -c "import cgi"
.
.
# /usr/lib/python2.4/socket.pyc matches /usr/lib/python2.4/socket.py
import socket # precompiled from /usr/lib/python2.4/socket.pyc
dlopen("/usr/lib/python2.4/lib-dynload/_socket.so", 2);
ILL
/root #
So, it was when python tried to dlopen a native module that everything went wrong. Further testing confirmed this. Loading a module with native code indeed crashed the interpreter whilst working nicely otherwise.
On a side note, doing this work was very time consuming because the system behaved nicely most of the time. In my setup I sometimes could reboot the system hundreds of times before the problem showed up. Thank good for Selenium helping me automating this. I wrote a short python script that would try to log in to the system and if successful then would reboot it through web interface and try again. (If you are into serious web development and haven’t checked Selenium out i seriously recommend doing so 😉 )
Another strange behavior of the system was that when the problem occurred it was permanent for that boot. I.e. the bug persisted until a reboot. Something in the boot process seemed to trigger the bug under rare circumstances.
Armed with the knowledge above that something regarding dlopen might trigger the bug I went on writing a short C-program which loaded a shared library, resolved a symbol to a function in the same library and called it. This to isolate if it was Python that was responsible for the error. To my big surprise this program, as well, terminated with an illegal instruction! The problem was not located in Python but on a lower level.
Further reducing the C-program removing all but the dlopen and dlclose functions. The problem still persisted. Thus it seemed like the problem might be located within in the libc, uClibc, performing the dlopen function.
This was a big problem because since this was an embedded system there where no debug symbols available on this level. All libraries where stripped to save space on the target system and the project provided toolchain provided no binaries with debug symbols neither.
So I had to do the best of the situation. I recompiled the libc on my host computer with debug symbols enabled and started a remote debug session against my test application running on target. (To be able to run the remote debugging on the target system i loaded a USB key with the target debug tools and mounted these on the target system)
Then on target i ran:
/ # /mnt/cg/gdbserver 192.168.0.0:2222 /mnt/dlopentest
Process /mnt/dlopentest created; pid = 529
Listening on port 2222
and then connecting to this from my desktop:
arm-none-linux-uclibcgnueabi-gdb dlopentest
.
.
(gdb) set solib-search-path /home/tor/build/uclibc/testinstall/lib
(gdb) target remote 192.168.10.98:2222
Remote debugging using 192.168.10.98:2222
The key here being the set solib-search-path that tells gdb to load my debug versions of the libraries instead of the toolchain provided ones. (Note that this is most likely not optimal since this is not the original libraries running on target)
Running gdb on the test application loading different libraries gave some strange findings. First off data seemed garbled in gdb when entering library code, libdl and ld.so.
Program received signal SIGILL, Illegal instruction.
0x4000f494 in dlopen (libname=0xbee5ed88 "h���\b \001", flag=74072)
at ldso/libdl/libdl.c:434
434
But not being sure if this was due to my separately compiled or the code itself was a setback. But I had also noted that during some runs I got a segmentation fault instead of the usual illegal operation. When I got the segmentation fault this occurred before main was entered.
Program received signal SIGSEGV, Segmentation fault.
0x40004df0 in check_match (sym=0x40013304, strtab=0x40013504 "",
undef_name=0x40026b0d "error_message_count", type_class=0)
at ./ldso/include/dl-string.h:78
The crash then was during the initial shared library loading performed by ld.so and not during dlopen as in the other situation. ld.so was clearly not even able to load libdl.
At this time i started to question libdl.so and what was going on inside it. I mounted the target via ssh to my desktop and started examining the library and to my big surprise got:
arm-none-linux-uclibcgnueabi-objdump -a libdl.so.0
BFD: libdl.so.0: invalid string offset 602752145 >= 133 for section `.shstrtab'
BFD: libdl.so.0: don't know how to handle section `' [0x 400c040]
arm-none-linux-uclibcgnueabi-objdump: libdl.so.0: File format not recognized
Objdump was unable to parse the library! Using readelf to confirm this:
There are 17 section headers, starting at offset 0x20b0:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .hash HASH 00000134 000134 000140 04 A 2 0 4
[ 2] .dynsym DYNSYM 00000274 000274 000290 10 A 3 3 4
...
[13] <corrupt> <unknown>: 400c 00009f98 001f98 000068 04 WA 0 0 4
[14] .bss NOBITS 0000a000 002000 000004 00 WA 0 0 1
[15] .ARM.attributes ARM_ATTRIBUTES 00000000 002000 000028 00 0 0 1
[16] .shstrtab STRTAB 00000000 002028 000085 00 0 0 1
Something clearly was wrong with libdl. Using dhex, a binary diff program, I also could confirm that there actually was differences in the corrupt libdl.so and the working one.
This would explain a lot of the behavior I was experiencing. The all or nothing, either works the complete boot or not at all.
When the illegal instruction happened the library that i tried to load would be corrupted and when python crashed with a segmentation fault libdl.so.0 would be corrupt.
Clearly something must corrupt the image after it has been loaded. The complete image is checksummed by the bootloader, uBoot, before the kernel is started. Further more most of the times the boot seems to complete without problems.
One further notation is that when the customer built the images the system seemed to fail more often. To examine this a bit more i built a virtual machine with almost the same specs as the customers workstation. In this i recompiled the image and started testing it.
Already om the second round the system failed. This was very repeatable approximately 50% of boots would fail on python.
Investigating this further remarkably showed that now it was no longer a problem with the shared library loading. All my previous tests would pass but the login script would not work, terminating with a segmentation fault.
Some serious debugging showed that loading the string module in python would crash the interpreter. Digging into that module revealed that it was when the module issued a map instruction the thing would explode. I could reduce code triggering the problem to:
/ # python -c "map(abs,[1])"
SEGV
The plot thickened. Since earlier problems was caused by corrupt files it most likely was the cause here as well. To analyze the situation i gathered md5 checksums of all files in the directories bin, etc, home, lib, root, sbin, usr and var and compared these sums with one calculated on the same files on the build machine.
This “quickly” revealed that in this case it was libpython2.4 that was the bad guy this time. Running my dlopentest on this library gave me the same result SEGV.
So there seemed to be a generic file corruption problem. The question was why this was so isolated to python and libraries used by python. To verify the extent of the problem I wrote a small python script that used pySerial, pyexpect and Gnomes GVFS.
The script would use the console on serial port to mount the usb memory with debug tools and sftp server, launch the sftp server and then mount the target file system on my desktop computer. Then the script would launch the md5sum calculations on all files store the result in a file and then umount everything and reboot the system to start all over again.
The output of this operation where a lot of files all containing the md5sums of the files of every run. I then compared these sums to the original file with checksums from the build machine.
To my very big surprise none of these files yielded corrupted files. They all gave the same checksums. Thus rebooting the system via command line most likely did not trigger the problem. Still rebooting via web browser immediately caused problems.
The conclusion then was that the system corrupted the image when interacted with over the network but not when it was operated via the serial port. This yielded that it most likely was some problem in the lower levels of the system.
This was also confirmed in parallell by one of the other members of the development team narrowing the problem down to the ethernet controller.
Analyzing the bootloader u-boot revealed that someone from the original development team had reenabled the ethernet controller just before the system where about to launch the Linux kernel. A comment indicated NFS-boot as the cause for doing so. (Even though this should not be needed when the Linux kernel would reinitialize the network controller during boot. Before doing any NFS operations.)
Leaving the ethernet controller running during Linux boot was indeed the culprit. The controller is set up to automatically transfer ethernet frames into buffers in RAM when they arrive on the wire. But when control was handed over to Linux this memory was reserved for other use. In this case it matched the location where the initramfs where loaded.
So then when there came network traffic addressed to the unit in the time slot between the start of the kernel until Linux had reinitialized the network controller this would corrupt the file system image by overwriting parts of it with the arriving frames.
Lessons learned would be to never trust the lower level of the system when experiencing intermittent hard to track errors. I wasted a lot of debug time here due to starting from the top of the stack working my way down.