Friday, February 18, 2022

Building out an intra-home data aggregator

 After I put a couple Raspberry Pi environment monitoring devices into my shopping cart and pile of unconnected devices, I looked around for a central data aggregator that would be a step up from individual cron jobs and RRD repositories. The open source platform Zabbix looked interesting, as it was noted in online posts about pulling data from remote devices. Around the end of January, as the weather pushed me to inside projects, I started figuring out the moving parts.

My first thought was I'd like to run the database and system on NetBSD, but after looking at the state of Zabbix server and agent version availability, I found FreeBSD has an advantage in having better coverage. I thought I could put the server processes on one node and the database itself on a different node, since I already had working PostgreSQL systems. Trying to activate NetBSD packages on the arm Pi systems ended up with a battle between mysql and mariadb, meaning the package wouldn't use postgreSQL as I wanted.

The package version of Zabbix for FreeBSD, is also pre-configured for MySQL, despite the configuration file indicating that a few tweaks could alter the database connection. I tried several permutations of ports and other values before coming to the realization that if I wanted a different database I'd need to build the application from sources. But running both a binary package repository and one from source on one node can be problematic, leading me to start from scratch with a fresh FreeBSD system.

After a few feints, I was able to get FreeBSD 13 set up on an SSD drive connected through USB to a Pi 4. Turns out, that was the easy part. I tried to find the least common denominator of underlying libraries to minimize the time spent watching auto-configure and make run through huge software stacks. I was expecting there might be several days worth of churn ahead but was pleasantly surprised as I watched the component parts get laid down.

Somewhere around Groundhog Day, I had the Zabbix server working, along with at least one other agent, and proceeded to deploy the front end. Fortunately, I had a working httpd server on another FreeBSD on Pi, so adding the php code was a minor hassle.

In a prior life, I worked with enterprise scale software/hardware/application monitoring software, from PMC Patrol/Enterprise Manager, to HP OpenView, to Computer Associates tools, and finally, to the now-tainted SolarWinds. Much of the design of Zabbix, as well as the agent technologies, looked pretty familiar. After the web front end was running, it looked very clean and modern, with menus and paths that looked straightforward.

[

 43331:20220204:013743.375 cannot send list of active checks to "x.x.x.x": host [Name] not found

]

(04 February 2022)

Alas, not everything was as simple as it appeared. The learning curve was not too steep, though it involved translating a few terms into understandable chunks. Like "not supported" as a state for a monitoring element. On the surface, that would mean to me that the combination was just not going to work, as opposed to a condition where an element went offline or was unreachable for some reason. There were more subtleties under the surface as I'd learn by trial and error.

The first hurdle was the nomenclature of server and agent, where you could put a label on something that matched a DNS entry, or didn't match. I knew that once I started making configuration choices (like short name or long name), I'd probably be stuck with that decision once the beast took on a life of its own.

[

 85706:20220205:173436.157 resuming Zabbix agent checks on host "sample": connection restored

]

(05 February 2022)


Finally, with database, server, agents, and the web front-end connected and working properly, it was time to examine the contents and see what hath been wrought.



Total disk space on "/" and Free disk space on "/" are shown on the graphic above, with 115 and 85 GB, roughly, or 100% and 74%. But wait, that red pie slice doesn't occupy 100% of a circle, it's only 26% (more or less). A graphic that "works" but is wrong.



This is good. Gold, even.












(07 February 2022)

Here, I noticed a large network stream to/from one device, which was apparently running an audio program that no one was listening to. It happens. So, the Zabbix charts revealed useful information within a couple days.

Back to the Raspberry Pi tuning tweaks. I found several examples of adding monitoring to Zabbix, and started with 2 of them. One has a bash script with over a dozen metrics included, and the other has 4 metrics contained within one add-on.

I learned the basic add-on set includes an XML or other defined method of setting up configuration, and a set of commands, usually shell scripts (but could be other languages if wrapped correctly).


Derived metrics

Obviously, with chips made these days, temperatures would be reported in Celsius not Fahrenheit, but as an American, I'm more conversant with the latter. So it would make sense to calculate a derived value inside the monitoring suite, as good practice for learning how to build and deploy future readings from wherever (home thermostat/outside weather/noise levels). Let's see: nine-fifths pus thirty-two, in words, works out to this in configure-speak:

(32+((9/5)*last(//raspberrypi.sh[temperature])))*1000

Wait, where did the 1000 come from?
Er, turns out the Raspberry Pi user interface reports temperatures with up to 3 digits, like this:

 $ cat /sys/class/thermal/thermal_zone0/temp
41856


$ vcgencmd measure_temp

temp=41.9'C

This measurement is 41 degrees, plus a fraction. Depending on how the data are pulled, the decimal points might get shifted around.

This chart shows the derived "Betriebstemperatur" in Fahrenheit. Errors in getting the formula correct caused the first several numbers to be incorrect, rather than simply missing. I expect the impact of this will diminish over time. I couldn't find a quick way to purge old data (yet).












(08 February 2022)

ZBX_NOTSUPPORTED: Invalid item key format.
ZBX_NOTSUPPORTED: Unsupported item key.

In checking out one metric, I noticed the above 2 message look similar, particularly the identical all-upper case intro, but invalid is not unsupported. The former looks more fixable on the surface; looking for the root cause would ascertain for sure

Now it's the 11th, after a week or so of building, deploying, configuring, troubleshooting, tuning, and rebuilding. I think this was worth it just to show the capability of a $100 Raspberry Pi + SSD combo with FreeBSD.

Issues and Fixes


[3.] Add zabbix user to video group
   $ sudo usermod -a -G video zabbix

This is necessary based on the default command permissions. Adjust based on user prefences.


In earlier Raspberry Pi versions, apparently the command to interrogate internal counters and more (vcgencmd) was installed under the directory /opt/vc/bin/. The "opt" directory is one of those UNIX relics like "/usr/local/" where custom software might be installed outside the base release. But, as happens, that location became obsolete when the newer versions but vcgencmd into /usr/bin which would be in a typical PATH search.  With the nature of some google searches leading to older code based on hit counts, you might be trying to run something that isn't there, with the resulting obscure side effects.

[
 85684:20220205:013853.811 item "pi.net:rpi.cpuVoltage" became not supported: Value of type "string" is not suitable for value type "Numeric (float)". Value "sh: 1: /opt/vc/bin/vcgencmd: not found"
]

So, I saw 2 obvious ways to fix this. First, alter the script to the correct path; second, put a link into the old location pointing to the new location. I chose the latter as having fewer steps, though purists may prefer to alter the source.

[
pi@pi:/opt/vc/bin $ sudo ln -s /usr/bin/vcgencmd vcgencmd

More  issues and fixes


[
zabbix_agentd [7518]: cannot create locks: cannot create semaphore set: [28] No
                        space left on device
]

This bug stumped me for a little while. On another node that was already running postgreSQL before adding a Zabbix agent, I was getting errors showing "no space left on device", despite having a nearly empty 500GB SSD. If I stopped the database, the agent would launch. But both would not run at the same time (on NetBSD 9.x).

Locks and semaphores are another obscure UNIX facility, going back to the early AT&T System V releases. I fortunately was experiences with configuring shared memory for large Oracle database deployments, so even though that was decades ago, the seeds are still there. Looking at the error message, it's unclear which memory parameter might be limiting, as several settings have very similar names mentioning semaphores and shared memory.

SHMMNI  Maximum number of shared memory segments system-wide 
SEMMNI  Maximum number of semaphore identifiers (i.e., sets)  
SEMMNS  Maximum number of semaphores system-wide  

The ipcs command will show the current state.

$ ipcs -a
IPC status from <running system> as of Mon Feb  7 02:04:06 2022


And the sysctl command will show kernel and other settings on BSD.

Before:

kern.ipc.semmni = 10
kern.ipc.semmns = 60
kern.ipc.semmnu = 30

After:

kern.ipc.semmni = 100
kern.ipc.semmns = 600
kern.ipc.semmnu = 300


I could have tried to optimize these settings one by one using small increments, but knowing that these default values date back decades to much less capable systems, I increased each of them by a factor of 10. I speculated that any wasted resources would be minimal, and was rewarded by both processes starting and running without errors.

ALSO: swap


pid 92246 (c++), jid 0, uid 0, was killed: out of swap space

I had tried to build an X Windows program after getting Zabbix working, but the compile failed with obscure "internal errors". Later I found the more succinct root cause of the failure: out of swap space. Wow, also an old timey issue on virtual memory systems from the 1980s like DEC VMS.

On a Pi Zero 2 W:

$ swapon --show
NAME      TYPE SIZE  USED PRIO
/var/swap file 100M 97.9M   -2

then, later:

$ sudo swapon
NAME           TYPE  SIZE USED PRIO
/var/swapfile2 file 1024M   0B   -2

For FreeBSD/NetBSD, swap metrics are tricky with Zabbix. I used the FreeBSD template to connect to NetBSD nodes as Zabbix only includes FreeBSD and OpenBSD (pity). The FreeBSD nodes reported swap issues different than the Linux conditions noted above, while the NetBSD swap metrics failed most likely due to syntax Babel amongst the BSD descendants. Though not directly Zabbix related, I wanted to address the out-of-the-box swap configuration, at least to have a learning experience, with the added risk factor of wiping an entire installation with an errant format command.

At first, it didn't appear that FreeBSD supported swap files but only devices. And as I didn't want to go back and try to repartition a running system I was leaning toward adding a USB memory dongle for swap when I dug deeper into the manual pages. My initial surmise was incorrect, I could build a swap file (if I wanted) in a manner very similar to the Linux steps, which makes future build errors less likely.
Initially, no swap on a FreeBSD Pi build:

[]# dd if=/dev/zero of=/var/swapfile2 bs=1024k count=16384
[]# chmod 0600 /var/swapfile2


/etc/rc.conf: 26 lines, 511 characters
swapfile="/var/swapfile2"   # Set to name of swapfile if desired.

[]# mdconfig -a -t vnode -f /var/swapfile2 -u 0 && swapon /dev/md0
[]#

Swap: 16G Total, 16G Free

warning: total configured swap (4194304 pages) exceeds maximum recommended amount (3928456 pages).
warning: increase kern.maxswzone or reduce amount of swap.
[]$

OK, I overdid it, but now I have the classic swap at > 2 times physical memory, ha!


The top command reports:

Swap: 16G Total, 11M Used, 16G Free

and then, with a big compile running:

Swap: 16G Total, 4770M Used, 11G Free, 29% Inuse, 18M In, 2124K Out








And: CPU Throttling


Initially, this metric failed with an error saying something obscure, then disabling later readings with the "not supported" declaration. Hitting the link changes the item to disabled, hitting it again enables the readings to be tried again. Of course, if the underlying glitch isn't fixed the result is again not supported/out of service on the next cycle.

Problem: find(/Raspberry Pi/rpi.cpuThrottled,,"iregexp","\\b(0x0)\\b")=0
Recovery: find(/Raspberry Pi/rpi.cpuThrottled,,"iregexp","\\b(0x0)\\b")=1


Here, the root cause of this issue was self-inflicted, somehow, in pulling down the configuration and transferring into the Zabbix server some kind of code page shift occurred, adding bogus text into the trigger definitions. Right out of the box, this failed with little fanfare. I researched the supplied functions and suspected the fault lie within.

Once I found the configuration details and could save them as above, I edited the parameters to more cogently reflect the expected function output. The output is normally "0x0", meaning zero, and higher hex values have specific meanings. Since the first pattern match always failed, the CPU always appeared throttled incorrectly.

New values:


 

Disk I/O


min(/pi/vfs.dev.read.await[mmcblk0],15m) > {$VFS.DEV.READ.AWAIT.WARN:"mmcblk0"} or
min(/pi/vfs.dev.write.await[mmcblk0],15m) > {$VFS.DEV.WRITE.AWAIT.WARN:"mmcblk0"}

Ending up setting the write time to 50ms to avoid pesky un-addressable errors; a future workaround might be to set up storage device classes so that SSD and SD cards are treated individually in terms of known capabilities.

"Linux block devices by Zabbix agent" is the template that contains macros where the above thresholds can be edited.




SNMP

Gah. I tried it out and Zabbix can go down that net walk with the best. I didn't find anything spectacular or dismal to report on, so this section will be brief.











These temperature values are reported without decimals, so the graph jumps in digital hops rather than spreading around in an analog sweeps. I wasn't that interested in finding out why system temperature was reported and not CPU. There were other temperature values that I'd investigate further once more trends age in place as it were.


Agent "2"


./configure --enable-agent2

Go errors

go: downloading github.com/mattn/go-sqlite3 v1.14.8
package zabbix.com/cmd/zabbix_agent2
        imports zabbix.com/plugins: build constraints exclude all Go files in /u
sr/local/src/zabbix/zabbix-5.4.9/src/go/plugins
*** Error code 1

Stop.
make[3]: stopped in /usr/local/src/zabbix/zabbix-5.4.9/src/go
*** Error code 1


Works on Windows 32/64.



Last thoughts

After letting things settle for a few days, I must say I admire the completeness of the installation, the vision, and the big things that cover the small mistakes. Without reading too many manuals, I was able to set up a variety of monitoring collections that help more than they hinder. The controls to display charts are intuitive, quick, and quite legible. I even found that zooming in on a time segment was as easy as highlighting a period with the pointer (though this action didn't work on an Android device for me).

This chart includes values from 2 sensors on one Raspberry Pi hats, which report in degrees Centigrade. Given the sensors are close enough to be affected by heat from the Pi itself, I added a calculation to return a value as close to ambient as I could manage using a reference thermometer and readings over a period of time. Since the sensor base readings differ, no wonder the results also don't coincide. But they are within one or two degrees (Fahrenheit) and produce useful information. ("Hey, close the door, do you live in a barn?")







No comments: