Monday, May 8, 2023

Beta Testing NetBSD 10; the Intermittent Result Set

After running sets of tests on the NetBSD 10.0 beta release, I had opened about 2 dozen problem reports of varying severity, and thought it time to take stock. Some have been closed, many have identified solutions, and a few are head scratchers. 

Full test cycle on an ARM soc

I set up a cron job to run the /usr/tests/ suite on a Pi 02W; it was taking a few hours to complete which allowed 8 hour restarts. The slice is roughly 8 hours, and includes CPU temperature, interrupts, and memory. Sorry I cut off the scale here.




This image is from 48 hours and shows CPU core temperature (Celsius) during the automated test cycle








The range goes from 45 to 50 with no tests running, then maxes to 60 at points.

Reviewing the test results (failures, in other words) reveals a few issues worth opening tickets, as did reading up on the test algorithms and reference pages. The aspect that intrigued me primarily were tests that did not pass or fail consistently. A typical case is running out of memory caused by creating sample data of variable dimensions. Side effects of this, for example:

[ 61059.604145] UVM: pid 12801 (h_libarchive), uid 0 killed: out of swap

Sometimes this is visible with vmstat commands:

 0 0     7960 330868    0   0   0    0    0    0  0 8519   18  36  0 0 100
 0 0     7960 330868    0   0   0    0    0    0  0 8521   18  38  0  0 100
Mon May  1 23:24:01 UTC 2023
 procs    memory      page                       disk faults      cpu
 r b      avm    fre  flt  re  pi   po   fr   sr l0   in   sy  cs us sy id
 1 0     1968 337656   47   0   0    0    4    4  1 8503  184  67  0  1 99
 1 0    10828 328948 7245   0   0    0    0    0 99 9117 5113 989  7 13 80
 2 0    19112 320688 2576   0   0    0    0    0 42 9392 4107 1979 12 16 72
 0 0     8568 331328 1534   0   0    0    0    0 89 8919 1788 949  1  5 95
 0 0     7896 332000  201   0   0    0    0    0  0 8534  254  44  0  0 100
 0 0     7896 332000    0   0   0    0    0    0  0 8527   18  36  0  0 100


Given those results are not indicative of system faults, looking at the other test case failures and sorting them by fixed/workarounds/etc, I have 3 that I don't know the root cause, and that's all of consequence. Two of them are wifi-relates and the third is a failure to compile profiling data into an an executable on a 32-bit Pi 02W. 

One of the intermittent has corresponding out-of-memory messages (57291). While it's a low priority issue because the tests would pass with more storage space, there is probably a way to avoid runaway space requests and still have useful tests.

Tickets with unknown root cause:

  1. misc/57303 [serious/medium]: ATF unit test usr.sbin/tcpdump/t_tcpdump fails when wireless active on amd64
  2. toolchain/57321 [non-critical/medium]: ATF test case usr.bin/cc/t_hello:hello_profile fails on RPI02W/evbarm only
  3. bin/57366 [serious/medium]: Automated test usr.sbin/tcpdump/t_tcpdump:promiscuous fails on Rpi3 with wifi active

Tickets with known cause/workaround:

  • kern/57185 [non-critical/medium]: Python build fails on 10_BETA due to no entropy on Atom CPU system
  • misc/57286 [serious/medium]: Unit test fs/tmpfs/t_vnode_leak fails in ATF Tests suite
  • misc/57291 [serious/medium]: Unit test for lib/libc/regex/t_exhaust fails in ATF Tests suite with signal 9
  • lib/57314 [serious/medium]: 
  • ATF unit tests fail on 3 of 7 cases in program lib/libc/c063/t_utimensat on evbarm/Rpi 02W
  • kern/57320 [serious/medium]: ATF test case kernel/t_magic_symlinks:machine_arch fails on RPI02W/evbarm only [?]
  • lib/57331 [serious/medium]: Automated unit test lib/libc/net/t_servent:servent fails on amd64 only
  • misc/57361 [non-critical/medium]: Automated test t_archive fails 2 test cases on an Rpi3

Tickets fixed:

  • misc/57284 [serious/medium]: Unit test for envstat fails in ATF Tests suite on one machine
  • kern/57319 [serious/medium]: ATF test case kernel/t_magic_symlinks fails as non-root instead of showing expected fail message

Tickets with intermittent pass/fail results:

  • misc/57291 [serious/medium]: Unit test for lib/libc/regex/t_exhaust fails in ATF Tests suite with signal 9
  • kern/57345 [serious/medium]: Automated test kernel/kqueue/t_empty fails intermittently on an amd64 machine
  • toolchain/57351 [serious/medium]: Automated test usr.bin/c++/t_tsan_vptr_race:vptr_race fails intermittently on an amd64 machine
  • kern/57371 [serious/medium]: Automated test fs/vfs/t_vnops:nfs_rename_reg_nodir fails intermittently on Rpi3 and Rpi4
  • kern/57385 [serious/medium]: Automated test case for puffs file system fails intermittently on different architectures


Documentation tickets:

  • misc/57318 [non-critical/low]: Minor typos in an automated test case - atf/tools/atf-run_test
  • misc/57332 [non-critical/low]: Replace 'http' with 'https' on netbsd.org links found in man pages
  • misc/57343 [non-critical/low]: Typo in automated test rumpkern/t_vm.c  ('this' should say 'thus')
  • misc/57344 [non-critical/low]: WIki page for evbarm port missing rpi4 mention
  • misc/57347 [non-critical/low]: Several man pages have obsolete file location references under /usr/share/doc
  • misc/57397 [non-critical/low]: Minor comment typos in t_vnops.c test program

No comments: