Quantcast
Channel: Blargh
Viewing all 112 articles
Browse latest View live

Measuring propagation using FT8

$
0
0

One obvious thing that you can do after putting up an amateur radio antenna is to operate a bit on FT8, to see how the propagation goes. Just transmit on all bands and see how for you get.

E.g. this map on pskreporter.info with 10W on my EFHW:

10W EFHW propagation

You can also use the [reverse beacon network][rev] with morse code:

Reverse beacon network for M0THC

But that’s just a few samples. What about more statistical data? And propagation over time? I don’t have access to the raw data from pskreporter.info, and even if I did I can’t just set up an automatic beacon tx round the clock every day without requesting a Notice of Variation.

I may do that some day, but it’s a project for another time.

For this post what I want to know is if my antenna setup is better for 20m or 40m. Subjectively it seems like more is trickling in on 40m. And when they say that 40m is better “at night”, what time exactly do they mean?

For passive listening my data will, of course, be heavily skewed by when people are awake and active. But that means it’s skewed towards representing “if I call CQ, how likely is it that I’ll get a reply”, which is also good.

I merge data across multiple days, and compare 20m on one day to 40m on another. So this is not pure science. I’m not claiming proper research here. I’m just playing with data that is very relevant for me (my antenna actually collected it), and in any case interesting.

I wrote a tool that takes signal reports and stuffs them into a json file, and configured [wsjtx][wsjtx] to send reports to it. Then I made some scripts that turn these into the graphs you’ll see below.

Results

ft8-decodes-day

So when they say 20m is a day band… yeah, at least for me. That straight line at “1” will for many time periods actually be “at most 1”, due to a graphing artefact.

ft8-decodes-distday

But the average distance picks up! Now, probably this is because it’s the best propagation time of day for the US, and the US is far away. So what does come through has a higher chance to be cross atlantic.

Very interesting.

Propagation should, as I understand it, be mostly bidirectional. So it seems that about 22:00 UTC on 20m is the best time for that. But for contacts anywhere it seems 40m about 19:00-20:00 UTC is best. If I were to guess it’s because this is evenings when Europeans are active, and the shorter reach (for me) 40m at least reaches all of Europe.

Future work

There’s more I want to do. Like plot a heat map over the world. And create animations. But this will have to do for today.


More FT8 propagation

$
0
0

Last month I graphed the distance to remote stations as a function of time of day.

Today I plotted the gridsquare locations on a world map:

Grid squares heard

Ignore the top right one. That’s “RR73”, and not a real grid square. The rest should be accurate.

More that can be done (more interesting with more data than I can get, though):

  • also take into account the received signal strength
  • …and number of unique callsigns per grid square
  • create animations over time

If I had access to the data from pskreporter I could even, instead of using just a callsign as input data, use a grid square as input.

So for example I could create an animation to show what the propagation was over the last week from any given gridsquare, and generate them on-demand.

Like last time the scripts are pretty hacky proof of concepts. But they work.

The uselessness of bash

$
0
0

The way I write automation for personal projects nowadays seems to follow a common pattern:

  1. A command line, that’s getting a bit long
  2. A bash script
  3. Rewrite in Go

Occasionally I add a step between 2 and 3 where I write it in Python, but it’s generally not actually gaining me anything. Python’s concurrency primitives are pretty bad, and it’s pretty wasteful.

Maybe there’s an actually good scripting language somewhere.

I should remember that writing a bash script (step 2) seems to almost never be worth it. If it’s so complicated that it doesn’t fit on one line, then it’ll become complicated enough to not work with bash.

There are two main things that don’t work well. Maybe there are good solutions to these problems, but I’ve not found them.

1. Concurrency

There are no good primitives. Basically only xargs -P and &. It’s annonying when you have an embarrassingly parallelizable problem where you want to run exactly nproc in parallel.

Especially error handling becomes terrible here.

2. Error handling

You can handle errors in bash scripts in various ways:

  1. || operator. E.g. gzip -9 < a > a.gz || (echo "handling error…")
  2. set -e at the top of the script. Actually you should always have this at the top of your scripts. A missed failed return code is probably always an error.

But this doesn’t handle all failures. E.g. if you have a pipeline, and anything but the first command fails.

(false | gzip -c > /dev/null) && echo yes

This is because by default the result of the whole pipeline is just the result of the last command.

You can fix this, but running set -o pipefail. This makes the exit code of the pipeline be the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully.

But even this is not good enough. A remaining problem is that a downstream command in a pipeline has no way to know if an upstream command failed.

Here’s an example command I tried to run:

gsutil cat gs://example/test.data \
       | sort -S300M \
       | gsutil cp - gs://example/test.data.sorted

If sort fails, for example because it runs out of space for temporary files, then all the second gsutil sees is that its input closes, and as far as it knows the data generation is now successfully completed, and it finalizes the upload.

So I want the second gsutil to be killed if anything earlier in the pipeline fails.

Yes, I could probably do an initial upload to a temporary file, marked with STANDARD storage class, and a TTL set to automatically delete, and then if the pipeline succeeds I can rename, set storage class, and change TTL.

But I shouldn’t have to! If only foo | bar killed bar (race condition safely) if foo failed then this wouldn’t be a problem.

I could do this in bash, with something like:

SELF=$$
mkfifo pipe1 pipe2
gsutil cp - gs://example/test.data.sorted < pipe2 &
CMD3=$!
(
  sort -S300M || (
    kill $CMD3
    # TODO: wait for $CMD3 to exit, to avoid race condition
  )
) < pipe1 > pipe2 &
CMD2=$!
(
  gsutil cat gs://example/test.data || (
    kill $CMD2
    # TODO: wait for $CMD2 to exit, to avoid race condition
    kill $SELF
  )
) < pipe1 > pipe2

Ugh.

So I end up writing something like this:

package main

import (
	"context""flag""io""os""os/exec""path"

	log "github.com/sirupsen/logrus"
)

var (
	indir  = flag.String("indir", "", "Input bucket and directory on GCS. E.g. for `gs://a/b/` say `a/b`.")
	outdir = flag.String("outdir", "", "Output bucket and directory on GCS. E.g. for `gs://a/b/` say `a/b`.")
)

func main() {
	flag.Parse()

	if *indir == "" || *outdir == "" || flag.NArg() == 0 {
		log.Fatalf("Need -indir, -outdir, and one arg")
	}

	fn := flag.Arg(0)

	inPath := "gs://" + path.Join(*indir, fn)
	outPath := "gs://" + path.Join(*outdir, fn)

	log.Infof("Running %q -> %q…", inPath, outPath)

	pipes := [][]string{
		{"gsutil", "cat", inPath},
		{"zcat"},
		{"awk", "-F,", `{print (15*int(($1%86400)/15)) "," $0}`},
		{"sort", "-t,", "-k", "1,2", "-n"}, // -S300M
		{"gzip", "-9"},
	}
	ctx := context.Background()
	storeCtx, cancelStore := context.WithCancel(ctx)

	var lastPipe io.Reader
	for _, args := range pipes {
		args := args
		cmd := exec.CommandContext(ctx, args[0], args[1:]...)
		if lastPipe != nil {
			cmd.Stdin = lastPipe
		}
		r, w := io.Pipe()
		cmd.Stdout = w
		cmd.Stderr = os.Stderr
		lastPipe = r

		go func() {
			defer func() {
				if err := w.Close(); err != nil {
					cancelStore()
					log.Fatalf("Closing for %q failed: %v", args[0])
				}
			}()
			if err := cmd.Run(); err != nil {
				log.Errorf("Failed to run %q: %v", args[0], err)
				cancelStore()
			}
		}()
	}

	store := exec.CommandContext(storeCtx, "gsutil", "cp", "-", outPath)
	store.Stdin = lastPipe
	store.Stderr = os.Stderr
	store.Stdout = os.Stdout
	if err := store.Run(); err != nil {
		log.Fatalf("Failed to store results: %v", err)
	}
}

Future work

It’s not nice to hard code the commands to run. So maybe the pipeline should be defined in JSON or something, and it’ll all be generic.

TODO :/

AX.25 in user space

$
0
0

The Linux kernel AX.25 implementation (and user space) is pretty poor. I’ve encountered many problems. E.g.:

  • you can’t read() and write() from the same socket at the same time

  • DGRAM receiving just plain doesn’t work.

  • CRC settings default such that at least all my radios (and direwolf) drop the first two packets sent. (fix with kissparms radio -c 1)

  • Setting CRC mode resets all other settings.

  • If you send 10 REJs in a row, Linux will resend the entire outstanding window 10 times, likely completely jamming your radio.

  • If Linux sends a whole bunch of data, and data frames contain an “ack” field, it’ll still send a needless RR frame to ack.

  • On 64bit Raspberry Pi OS setsockopt for some flags don’t take effect at all (e.g. setting AX25_EXTSEQ), and treat other obvious correct ones as invalid (e.g. can’t set AX25_WINDOW to any value at all).

  • I also get kernel null pointer dereferences on 32bit Raspberry Pi OS when testing AX.25. Not exactly comforting.

  • Other OSs don’t have AX.25 socket support. E.g. OpenBSD. And it’s not obvious to me that this is best solved in kernel space.

  • It doesn’t seem clear to anyone how the AX.25 stack in the kernel is supposed to work. E.g. should axparms -assoc be an enforcing ACL? It’s not, but is it supposed to be?

  • I’ve also seen suggestions that AX.25 should be ripped out of the Linux kernel. Don’t know how plausible that is though.

In any case it seems to me that had AX.25 been designed today I don’t think it would have been integrated in the Linux kernel.

There’s no need for good CPU performance with these low speeds (kbps), and nowadays one should expect a bugfix to a kernel implementation to take several years before users will receive it. Long gone are the days when you can expect users to “just recompile the kernel”.

I’ve sent a patch or two, but they don’t appear to get much interest. Nor does anyone seem to have a vision of how it should work.

You could probably create a good implementation in user space and get it onto users machines before fixes to any of the above problems trickle down through distribution kernel updates.

So I’m implementing AX.25 in user space

Here’s the code.

The way I think makes most sense is to split out the different functions into different binaries. There’s in my opinion no reason one tool should be doing soundmodem, IGate, and radio repeater (i.e. direwolf). It makes it harder to set up one audio-based TNC next to another radio that has a built-in TNC.

By splitting the functionality controllable by the user, it’ll be easier to build a layer at a time.

The exposed API is gRPC, thus making it language agnostic and with clear API boundaries. And being a network protocol it’ll make it easier to have amateur radio tools on one machine utilize radios all over the world by connecting instead of routing.

E.g. you can run direwolf on a raspberry pi up in your attic, but on it only run the daemon that interfaces KISS with the gRPC API. That’s easier than setting up AX.25 routing between the two machines, or being forced to use higher protocols.

ax25spyd does part of this, but still relies on the kernel implementation.

It’s not done

At this stage it’s almost on a proof-of-concept level. The code is ugly, and the packet scheduler is just the first implementation I could think of that should work.

But more importantly the bigger picture is unclear. I have some ideas about a higher level routing protocol, using internet and radio links as appropriate. But since I don’t yet have much experience with NET/ROM or ROSE, and I don’t want to start designing too much without first fully understanding what’s out there.

But luckily that doesn’t prevent me implementing AX.25 as data link layer 2. NET/ROM et. al. could be implemented on top of it.

But does it work?

Yeah. Here’s a video showing a simple connection to an axsh server:

The packet decoder runs on the right hand side, but only decodes the received packets. That’s another TODO, but at least you can see that packets are decoded.

(Most of the packet decoding effort is actually to decode the many types of APRS reports, and it’s also not complete)

What’s next?

  • Server-side sockets for SEQPACKET.
  • Finish SEQPACKET implementation, with SABME, windows, etc…
  • Complete APRS decode.
  • LD_PRELOAD library to overload socket() calls.

Unifi docker upgrade

$
0
0

This post is mostly a note to self for when I need to upgrade next time.

Because of the recent bug in log4j, which also affected the Unifi controller, I decided to finally upgrade the controller software.

Some background: There a few different ways to run the controller. You can use “the cloud”, run it yourself on some PC or raspberry pi, or you can buy their appliance.

I run it myself, because I already have a raspberry pi 4 running, which is cheaper than the appliance, and gives me control of my data and works during an ISP outage.

I thought it’d be a good opportunity to play with docker, too.

How to upgrade

Turns out I’d saved the command I used to create the original docker image. Good thing too, because it seems that upgrading is basically delete the old, install the new.

  1. Take a backup from the UI.
  2. Stop the old instance (docker stop <old-name-here>).
  3. Take a backup of the state directory.
  4. Make sure the old instance doesn’t restart (docker update --restart=no <old-name-here>).
  5. Create a new instance with the same state directory.
  6. Wait a long time (at least on Raspberry Pi), like 15 minutes at least, for the controller to finish starting up.
  7. Try to log in. Hopefully should work now.

Creating an instance:

#!/bin/shVER=6.5.54

exec docker run -d \
  --name=unifi-2021-12-14 \
  -e PUID=1000 \
  -e PGID=1000 \
  -e MEM_LIMIT=1024M `#optional` \
  -p 3478:3478/udp \
  -p 10001:10001/udp \
  -p 8080:8080 \
  -p 8443:8443 \
  -p 1900:1900/udp `#optional` \
  -p 8843:8843 `#optional` \
  -p 8880:8880 `#optional` \
  -p 6789:6789 `#optional` \
  -p 5514:5514 `#optional` \
  -v /home/pi/unifi/controller:/config:rw \
  --restart unless-stopped \
  ghcr.io/linuxserver/unifi-controller:$VER

Linux sound devices are a mess

$
0
0

It started with a pretty simple requirement: I just want to know which sound card is which.

Background about the setup

I sometimes play around with amateur radios. Very often I connect them to computers to play around. E.g. JS8Call, FT8, SSTV, AX.25, and some other things.

This normally works very well. I just connect radio control over a serial port, and the audio using a cheap USB audio dongle. Sometimes the radio has USB support and delivers both a serial control port and an audio interface over the same cable.

The problem

So what if I connect two radios at the same time? How do I know which sound card, and which serial port, is which?

Both serial ports (/dev/ttyUSB<n>) and audio device numbers and names depend on the order that the devices were detected, or plugged in, which is not stable.

The fix for serial ports

Serial ports are relatively easy. You just tell udev to create some consistent symlinks based on the serial number of the USB device.

For example here’s the setup for a raspberry pi that sees various radios at various times (with some serial numbers obscured) added as /etc/udev/rules.d/99-myserial.rules.

ACTION=="add", KERNEL=="ttyUSB*", SUBSYSTEMS=="usb", ATTRS{idVendor}=="10c4", ATTRS{idProduct}=="ea60", ATTRS{serial}=="IC-7300 0301XXXX", SYMLINK+="usb/ic7300"
ACTION=="add", KERNEL=="ttyUSB*", SUBSYSTEMS=="usb", ATTRS{idVendor}=="10c4", ATTRS{idProduct}=="ea60", ATTRS{serial}=="IC-9700 1300XXXX A", SYMLINK+="usb/ic9700-A"
ACTION=="add", KERNEL=="ttyUSB*", SUBSYSTEMS=="usb", ATTRS{idVendor}=="10c4", ATTRS{idProduct}=="ea60", ATTRS{serial}=="IC-9700 1300XXXX B", SYMLINK+="usb/ic9700-B"
ACTION=="add", KERNEL=="ttyUSB*", SUBSYSTEMS=="usb", ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6001", ATTRS{serial}=="AK06XXXX", SYMLINK+="usb/kx2"
ACTION=="add", KERNEL=="ttyACM*", SUBSYSTEMS=="usb", ATTRS{idVendor}=="1546", ATTRS{idProduct}=="01a7", SYMLINK+="usb/gps"

But now on to the tricky part: audio device names.

Audio device names

ALSA

Linux has a long history of audio APIs. Like OSS, ALSA, PulseAudio, Jack, and now PipeWire, to name a few.

It’s a big mess, and they don’t seem to have any consistent name at all. Not across unplug/replug, and not between each other.

From what I can gather this is the situation:

When detected every audio device shows up in /dev/snd/ as “a few” devices, with sequential numbers in them. E.g. plugging in the first radio into my raspberry pi creates /dev/snd/controlC2, /dev/snd/pcmC2D0c, and /dev/snd/pcmC2D0p. This is the device used for mixer, capture, and playback, respectively.

That’s ALSA cards. They can be listed in a more friendly manner with aplay -l or arecord -l and will be numbered in the same way. The /dev/snd/pcmC2D0p card (on the playback side) shows up like this:

$ aplay -l
[…]
card 2: CODEC [USB Audio CODEC], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

But “card 2” is not a playback device. No that’s just a card.

aplay -l is good for a human to check what sound cards are there for playback, but is not the name used for playback itself.

The cards used for playback includes some more settings. I believe this is so that an application can just open a named playback device, and get 7.1 surround sound, as opposed to another for 2.1 surround.

To see all of these devices you can run aplay -L. These names are not particularly helpful. For example:

$ aplay -L | grep -A 2 sysdefault
sysdefault:CARD=b1
    bcm2835 HDMI 1, bcm2835 HDMI 1
    Default Audio Device
--
sysdefault:CARD=Headphones
    bcm2835 Headphones, bcm2835 Headphones
    Default Audio Device
--
sysdefault:CARD=CODEC
    USB Audio CODEC, USB Audio
    Default Audio Device

I’ve found that the plughw: device tends to work best. Of course it’s not the only syntax that can be used. plughw:2,0 can also work great for ALSA card 2, subdevice 0.

You can create new card “aliases” using symlinks:

$ cd /dev/snd
$ sudo ln -s pcmC2D0p pcmC11D0p
$ sudo ln -s controlC2 controlC11
$ aplay -l
[…]
card 2: CODEC [USB Audio CODEC], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 11: CODEC [USB Audio CODEC], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

But it doesn’t create uniquely named playback devices:

$ aplay -L | grep plughw
plughw:CARD=b1,DEV=0
plughw:CARD=Headphones,DEV=0
plughw:CARD=CODEC,DEV=0
plughw:CARD=CODEC,DEV=0

For devices that allow us to provide the playback device as a string (e.g. direwolf, gnuradio) we can reference this new alias uniquely as plughw:11,0.

It’s still not consistently named, though. If you plug in two identical USB sound cards (or radios have the same chip built in), then (in my case) the first will be named CODEC, and the second CODEC_1.

If both are plugged in / get power at the same time (which is the case for me, since they share a power supply), then which is which is randomized by a race condition.

PulseAudio

On top of ALSA PulseAudio adds another layer. PulseAudio is one of the buggiest parts of a modern Linux system, and regularly causes breakage, or consumes whole modern CPU cores to do what was easily done on quarter century old hardware without breaking a sweat.

Luckily PipeWire is replacing PulseAudio (and Jack), while remaining API-compatible, so at least that part should improve over the next few years as this pile is phased out.

But it also doesn’t change much else. PipeWire is still a layer on top of ALSA, and has the same problems I’m trying to fix in this article.

Anyway. PulseAudio takes the ALSA card and exposes it as a PulseAudio source & sink. The card configuration can be dumped with:

$ pacmd dump | grep 'device_id="2"'
load-module module-alsa-card device_id="2" name="usb-Burr-Brown_from_TI_USB_Audio_CODEC-00" card_name="alsa_card.usb-Burr-Brown_from_TI_USB_Audio_CODEC-00" namereg_fail=false tsched=no fixed_latency_range=no ignore_dB=no deferred_volume=yes use_ucm=yes card_properties="module-udev-detect.discovered=1"
$ pacmd dump | grep usb-Burr-Brown_from_TI_USB_Audio_CODEC-00
load-module module-alsa-card device_id="2" name="usb-Burr-Brown_from_TI_USB_Audio_CODEC-00" card_name="alsa_card.usb-Burr-Brown_from_TI_USB_Audio_CODEC-00" namereg_fail=false tsched=no fixed_latency_range=no ignore_dB=no deferred_volume=yes use_ucm=yes card_properties="module-udev-detect.discovered=1"
[…]
set-default-sink alsa_output.usb-Burr-Brown_from_TI_USB_Audio_CODEC-00.analog-stereo
set-default-source alsa_input.usb-Burr-Brown_from_TI_USB_Audio_CODEC-00.analog-stereo

Here we see the full PulseAudio name for the stereo input and output for the card. So yay, another name.

So which devices are actually used by apps?

Some app (e.g. wsjtx, js8call) seem to list both ALSA devices and PulseAudio devices, whereas others (e.g. qsstv) list only ALSA devices, even though it lets you choose between playback directly via ALSA or via PulseAudio.

Ugh. Presumably there are also apps out there that’ll only allow using the PulseAudio name. So turns out I can get away without a consistent PulseAudio name for now, though.

So how do we name them consistently?

With no serial number to make udev decisions on, there are only two options left:

  1. Make decision based on which USB port it’s connected to, and always plug into the same port.
  2. If anything else is plugged into that port, and that other thing has a serial number, then you can write a script that snoops this information and feeds back the relevant setting back into udev.

I went with (1), but it should not be hard to use the udev PROGRAM and %c option to shell out to a script that does (2). For now I’m leaving it as an exercise for the reader.

Then when our code can find the devices, and tell them apart, the second question is what we do with it. What names do we want consistent?

Do we want ALSA card number to be consistent?

Then we’ll need a SYMLINK+=snd/%c setting in udev.

Consistent ALSA card number allows consistent plughw:11,0.

There’s some incorrect documentation out there. For example these instructions say that you can set a NAME based on your script. But nope, can’t do that. Symlink yes, but name no.

E.g. /etc/udev/rules.d/99-myrules.rules this should work:

KERNEL=="controlC[0-9]*", DRIVERS=="usb", PROGRAM="/usr/local/bin/alsa_name.py %k", SYMLINK+="snd/%c"
KERNEL=="hwC[D0-9]*", DRIVERS=="usb", PROGRAM="/usr/local/bin/alsa_name.py %k", SYMLINK+="snd/%c"
KERNEL=="midiC[D0-9]*", DRIVERS=="usb", PROGRAM="/usr/local/bin/alsa_name.py %k", SYMLINK+="snd/%c"
KERNEL=="pcmC[D0-9cp]*", DRIVERS=="usb", PROGRAM="/usr/local/bin/alsa_name.py %k", SYMLINK+="snd/%c"

You can then have your script check all sorts of things coming is as environment variables, and adjust the name to a new number, thus creating the symlinks created manually, above.

A simple script only checking the USB port path can look like:

#!/bin/bashNAME="$1"if echo"$DEVPATH" | grep -q 1-1.4; then
  NAME="$(echo"$NAME" | sed -r 's/(.*)C([0-9]+)(.*)/\1C11\3/')"fi
if echo"$DEVPATH" | grep -q 1-1.3; then
  NAME="$(echo"$NAME" | sed -r 's/(.*)C([0-9]+)(.*)/\1C12\3/')"fi
exec echo"snd/$NAME"

What’s left as an exercise to the reader here is to compare $DEVPATH with anything else coming from the same device. That way when you connect a radio that presents both a serial port and an audio card, you can take the serial number from the serial port and use it to decide what to name (number) the audio card.

After adding/changing rules, you shouldn’t need to run anything since udev is supposed to watch for new rules, but sometimes you do need to run sudo udevadm control --reload-rules. Then you can run sudo udevadm trigger to supposedly apply the config.

But really what you need to do is unplug and re-plug the USB cable, since running these triggers doesn’t actually work right, and can leave things in a broken state.

If you did it right then you should now see your card duplicated under a second number when you run aplay -l, just like the manual symlinks above.

Do we want consistent PulseAudio devices?

Once you have a consistent ALSA card number, you can create consistent PulseAudio playback/capture devices (aka sink/source to PulseAudio, because why even have common terminology?) from ALSA devices by running various pacmd commands, like:

N=11
DEV="radio-7300"
pacmd load-module module-alsa-card \
    device_id="${N}" name="${DEV}" \
    card_name="alsa_card.platform-${DEV}_audio" \
    namereg_fail=false tsched=no fixed_latency_range=no \
    ignore_dB=no deferred_volume=yes use_ucm=yes \
    card_properties="module-udev-detect.discovered=1"
pacmd suspend-sink alsa_output.${DEV}.analog-stereo no
pacmd suspend-source alsa_input.${DEV}.analog-stereo no

Do we want QT dropdowns to contain our ALSA device?

Since some applications (e.g. qsstv) don’t list PulseAudio source/sinks, we may have to create consistent ALSA device too.

Neither consistent ALSA number nor PulseAudio name will help you, since the dropdowns in qsstv, js8call, and wsjtx don’t allow you to enter whatever you want. They want ALSA devices in the form plughw:CARD=CODEC,DEV=0. By ID (CODEC), not by number (11).

In udev you can not only set ALSA symlinks, but also other attributes. This CODEC both audio cards have is a problem, so let’s override it based on where it’s plugged in.

So in a fantastic display of burying the lede, this (and only this) is what I put in my udev rules to get consistent ALSA text and QT UI dropdown support for consistent device names:

SUBSYSTEM=="sound",KERNELS=="1-1.4.4:1.0",ATTR{id}="CODEC_7300"
SUBSYSTEM=="sound",KERNELS=="1-1.3.4:1.0",ATTR{id}="CODEC_9700"

You can get the KERNELS path using udevadm info -ap /sys/class/sound/controlC2 | grep KERNELS.

After you’ve done that you should see:

$ aplay -l
[…]
card 2: CODEC_9700 [USB Audio CODEC], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 3: CODEC_7300 [USB Audio CODEC], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
$ aplay -L | grep plughw.*CODEC
plughw:CARD=CODEC_9700,DEV=0
plughw:CARD=CODEC_7300,DEV=0

I still don’t have a consistent PulseAudio source/sink pair, but so far I’ve not needed it. If that changes I can create it by setting a consistent ALSA number and then creating the PulseAudio device as described above.

Further work not covered

I believe that you can create ALSA devices similar to how you create PulseAudio ones using .asoundrc or /etc/alsa/conf.d/. You may have to generate configs at plug-in time using udev, though.

But I got what I wanted, so I’m out. It’s a god damn mess.

Virtual audio cables

$
0
0

This is another post about the mess that is Linux audio. To follow along you may want to read the previous one first.

The goal this time

This time I want to create a virtual audio cable. That is, I want one application to be able to select a “speaker”, which then another application can use as a “microphone”.

The reason for this is that I want to use GNURadio to decode multiple channels at the same time, and route the audio from the channels differently. Specifically my goal is to usy my ICom 7300 in IF mode (which gives me 12kHz of audio bandwidth) tuned to both the FT8 and JS8 HF frequencies, and then let wsjtx listen on a virtual sound card carrying FT8, and JS8Call listen to a virtual sound card carrying JS8.

You can also use it to have an SDR capture the 40, 15, and 20 meter bands, and create “audio cards” for every single digital mode in all these bands.

If you have a fast CPU you can capture every single one of the HF bands and do this. Though you might be limited by the dynamic range of your SDR.

Creating virtual cables

We could use modprobe snd_aloop to create loopback ALSA devices in the kernel. But I’ve found that to be counter intuitive, buggy, and incompatible (not everything application supports the idea of subdevices). It also requires root, obviously. So this is best solved in user space, since it turns out it’s actually possible to do so.

Another way to say this is that any time you want to do anything with audio under Linux, you have to carefully navigate the minefield, and not waste time on the many haunted graveyards. If you stick to my path you’ll do fine. Hopefully.

My setup is that I’m using ALSA (because this is Linux), and PipeWire (because it’s the future and PulseAudio is the sucky past).

GNU Radio needs access to the audio devices for my use case, and it only supports ALSA. Sigh.

In the last post I said that PipeWire (like PulseAudio) builds on top of ALSA. This is only partly true. As I said in the end of that post ALSA can also create virtual sinks and sources on top of the devices.

So in an awesome layering violation we’ll have PipeWire create the virtual cable, and then ask ALSA to create a source and sink for it.

GNURadio will use the ALSA device, and the other applications will use the PipeWire/PulseAudio devices, because they can.

Create virtual cables

pw-loopback -m '[ FL FR ]' --capture-props='media.class=Audio/Sink node.name=ft8_sink' &
pw-loopback -m '[ FL FR ]' --capture-props='media.class=Audio/Sink node.name=js8_sink' &

These loopback commands need to continue running. They are the “drivers” for the two loopback cables. But drivers merely in user space talking to PipeWire, so no root permissions needed.

The next step is to create ALSA devices for these four endpoints (two virtual cables with tow ends each). That can be done by adding the following to ~/.asoundrc:

pcm.ft8_monitor {
    type pulse
    device ft8_sink.monitor
}
ctl.ft8_monitor {
    type pulse
    device ft8_sink.monitor
}
pcm.ft8_sink {
    type pulse
    device ft8_sink
}
ctl.ft8_sink {
    type pulse
    device ft8_sink
}
pcm.js8_monitor {
    type pulse
    device js8_sink.monitor
}
ctl.js8_monitor {
    type pulse
    device js8_sink.monitor
}
pcm.js8_sink {
    type pulse
    device js8_sink
}
ctl.js8_sink {
    type pulse
    device js8_sink
}

Now you should be able to play audio into ft8_sink and capture it from ft8_monitor, and the same for JS8.

GNURadio flowgraphGNURadio demo running

When running this test you’ll probably hear the output through your default speakers. You may want to go to the “Output Devices” tab of pavucontrol and mute the two loopbacks. They should still produce sound into the virtual sources, just not play through the normal speakers.

Devices created, now let’s use them

The ICom 7300 can over USB either send audio as you would hear it normally (AF), or be set to send the IF data. The benefit of the latter is that you get about 12kHz instead of just 4.

This mode can be configured in Set->Connectors->ACC/USB Output Select. Then for some reason you need to set the radio in FM mode. I believe this is because in other modes there’s a filter before the IF part that severely reduces the available spectrum.

Because the IF frequencies are output, this is before the FM decoder, so it’ll still work for all this that otherwise would need to be SSB.

So if we tune to 7.078 we’ll actually monitor both FT8 and JS8 at the same time (also FT4 at 7.080Mhz).

All we have to do is take the input signal and produce two outputs, frequency translated and filtered appropriately.

Frequency translatedFrequency translated flowgraph

And these outputs are now available to both JS8Call and WSJTX to listen to at the same time, on different virtual audio cards, frequency translated as if they got the signal directly from the radio.

Well, that was easy.

  • https://wiki.gnuradio.org/index.php/ALSAPulseAudio

Raspberry Pi bluetooth console

$
0
0

Sometimes you want to connect to a bluetooth on the console. Likely because you screwed something up with the network or filewall settings.

You could plug in a screen and keyboard, but that’s a hassle. And maybe you didn’t prepare the Pi to force the monitor to be on even if it’s not connected at boot. Then it just doesn’t work.

Even more of a hassle is to plug in a serial console cable into the GPIO pins.

But modern Raspberry Pi’s have bluetooth. So let’s use that!

Setting up the service on the raspberry pi

Create /etc/systemd/system/bluetooth-console.service with this content:

[Unit]
Description=Bluetooth console
After=bluetooth.service
Requires=bluetooth.service

[Service]
ExecStart=/usr/bin/rfcomm watch hci0 1 getty rfcomm0 115200 vt100
Restart=always
RestartSec=10
StartLimitIntervalSec=0

[Install]
WantedBy=multi-user.target

This sets up a console on bluetooth channel 1 with a login prompt. But it doesn’t work yet. Apparently setting After, Required, and even Requisite doesn’t prevent systemd from running this before setting up bluetooth (timestamps in the logs don’t lie). Hence the restart stuff.

I also tried setting ExecStartPre / ExecStartPost there to enable Bluetooth discoverability, since something else in the boot process seems to turn it back off if I set it here.

So I did a terribly ugly solution and added this to /etc/rc.local

(
  while true; do
    sleep 20
    /bin/hciconfig hci0 piscan
  done
) &

If it works it’s not stupid?

Connecting from a laptop or something

Use the bluetooth address from hcitool dev hci0 on the raspberry pi, and bind channel 1 to rfcomm0.

rfcomm bind rfcomm0 XX:YY:ZZ:AA:BB:CC 1

This will make your laptop connect to the raspberry pi on demand. Which means it’ll take a few seconds once you actually try to connect. Alternatively you can use rfcomm connect, which connects immediately, but then continues running in the foreground.

Then you can run minicom or screen /dev/rfcomm1 115200 and be connected to a terminal.

Security

This currently has no PIN to connect over bluetooth. So anyone could just connect and start bruteforcing the password.

So I recommend setting up one-time passwords such as OTPW.

If you don’t trust bluetooth security then you should also assume that anyone can see your connection. And potentially hijack it after you log in. So log out as soon as you’re done. And if the connection mysteriously seems to “hang”, then pull the power to the device. Though it’s far from perfect.

You may also be subject to a MitM attack. This clearly needs more work.

So maybe it would be better to run a network, IP over bluetooth, and use SSH.

But adding more dependencies like that could backfire. What if the thing I broke was SSH, and that’s the reason I need the console? Then I still need to drag out console cables and such.

TODO

I should update this post when I bother setting up a bluetooth PIN. Hell, I don’t even know if not using a PIN means the bluetooth encryption is essentially plaintext.

Also I need to experiment more with making sure it’s not pairable and discoverable 24/7.

Telling bluetoothctlpairable no and discoverable no seems to do it, but this will have to be all for now.

I’ll post this even though it’s not done, since I don’t know when I’ll next get to it.

Update

Jesus christ this takes 35% CPU on a raspberry pi because rfcomm watch checks if the child process has exited once every 200 nanoseconds.

And it’s been that way since 2006.

This really shows how underused Bluetooth is. This bug has not been found for 16 years, presumably because nobody bothers with bluetooth because it’s so often full of compatability problems anyway.

For now I’ve built my own fixed version of rfcomm.


SSH over bluetooth

$
0
0

Yesterday I set up a simple serial console over bluetooth as a backup console.

Today I’m running SSH over bluetooth. Raw SSH, no IP. I only use IP on the two ends to talk to the SSH client and server. It doesn’t actually go over the bluetooth.

This fixes the security aspects with the previous solution. As long as you make sure to check the host key signature it’ll be perfectly secure.

No need for one-time passwords. You can even use SSH pubkey auth.

Connect to the system SSH

Server:

rfcomm watch hci0 2 socat TCP:127.0.0.1:22 file:/proc/self/fd/6,b115200,raw,echo=0

Client:

sudo rfcomm bind rfcomm2 AA:BB:CC:XX:YY:ZZ 2
ssh -oProxyCommand="socat - file:/dev/rfcomm2,b115200,raw,echo=0" dummy-hostname

I’m actually replacing rfcomm & socat with my own much simpler tool, so that I can do:

ssh -oProxyCommand="sshbthelper AA:BB:CC:XX:YY:ZZ 2" dummy-hostname

without needing root to create /dev/rfcomm2 and some other improvements. I’ll opensource it “soon” and will link from here.

I’m also simplifying the server side to just be socat exec:./btlisten tcp:127.0.0.1:22.

Stay tuned.

A backup SSH

If you’re messing around with an OpenSSH config then it may be a good idea to set up a minimal config on another port. Maybe port 23. Not like that port is used for anything else anymore.

SSH over bluetooth - cleanly

$
0
0

In my previous two posts I set up a login prompt on a bluetooth serial port and then switched to running SSH on it.

I explicitly did not set up an IP network over bluetooth as I want to minimize the number of configurations (e.g. IP address) and increase the chance of it working when needed.

E.g. firewall misconfiguration or Linux’s various “clever” network managers that tend to wipe out network interface configs would have more of a shared fate with the primary access method (SSH over normal network).

This post is about how to accomplish this more properly.

The problems now being solved are:

  • It wasn’t entirely reliable. The rfcomm tool is pretty buggy.

  • There was no authentication of the Bluetooth channel. Not as much a problem when doing SSH, but if there are passwords then there could be a man-in-the-middle attack.

  • The server side had to remain discoverable forever. So anyone who scans for nearby bluetooth devices would see your servers, and would be able to connect, possibly brute forcing passwords. Not as much of a problem if running SSH with password authentication turned off, but why broadcast the name of a server if you don’t have to?

So here we’ll instead explicitly pair with a PIN just once, and then turn off discoverability.

Step 0: Install my helpers on both server and client

git clone https://github.com/ThomasHabets/bthelper
cd bthelper
./bootstrap.sh && ./configure && make && make install

1. On the server: get the MAC

This will be needed by the client to pair and to connect.

$ hcitool dev
Devices:
        hci0     AA:BB:CC:DD:EE:FF

2. On the server: Start a PIN agent

By default PINs will be skipped (or just use a well-known like 0000), it seems.

At first I tried activating an agent inside bluetoothctl, but from what I can tell it’s only “reactive”, in that it’ll ask the user only if it’s needed. This means that if both sides use this agent then they will “just work” without bothering with a PIN.

So we need an agent that goes “no, really. You need a PIN”.

$ sudo apt install bluez-tools
$ sudo bt-agent

Let this run in the terminal during pairing, as this agent will make up a PIN and will print it to stdout, which you’ll need during pairing.

3. On the server: Turn off built-in agent and enable discoverability

[bluetooth]# agent off
[bluetooth]# discoverable yes

This makes the server show up in scans, and is required (as far as I can tell) in order to pair.

Note that the device and its name are now visible to anyone within range.

4. On the client: pair

sudo bluetoothctl
[bluetooth]# scan on
[bluetooth]# pair AA:BB:CC:DD:EE:FF
[CHG] Device AA:BB:CC:DD:EE:FF Connected: yes
Request passkey
[agent] Enter passkey (number in 0-999999): <enter number from server's agent here>
[…]
[CHG] Device AA:BB:CC:DD:EE:FF Paired: yes

Now the two devices are paired and share a link key. You can inspect it in on both sides in /var/lib/bluetooth/<my mac address>/<peer mac address>/info as LinkKey.

When the link key is set up there’s no longer any need for the server to be discoverable.

Now you can turn scanning back off

[bluetooth]# scan off

5. On the server: turn discover mode back off, and agent back on

[bluetooth]# discoverable no
[bluetooth]# agent on

6. On the server: kill the bt-agent

Control-C will do it.

7. On the server: Set up SSH service on an RFCOMM channel

Here I randomly pick 3. It’s possible to choose channel 0, and then register it with the “UUID to channel name” lookup service that is SDP.

But here I want minimal complexity and dependency, so I’m skipping that.

# cat > /etc/systemd/system/bluetooth-ssh.service
[Unit]
Description=Bluetooth SSH
After=bluetooth.service
Requires=bluetooth.service

[Service]
ExecStart=/usr/local/bin/bt-listener -c 3 -t 127.0.0.1:22
Restart=always
RestartSec=10
StartLimitIntervalSec=0

[Install]
WantedBy=multi-user.target

This can also be done with rfcomm watch, but I found it to be very unstable and CPU-heavy (~35% CPU on raspberry pi when idling). It also needlessly allocates a /dev/rfcomm<N> device (requiring running as root). And it looks like rfcomm in bluez is deprecated anyway.

But it can also be done with socat exec:"/usr/local/bin/bt-listener -c 3" tcp:127.0.0.1:22, but this doesn’t allow for concurrent connections, and requires waiting for RestartSec seconds before a new connection can be started after another.

8. On the client: Configure an SSH alias

$ cat >> ~/.ssh/config
Host servernamehere-console
     ProxyCommand /usr/local/bin/bt-connecter AA:BB:CC:DD:EE:FF 3

Now you should be able to ssh servernamehere-console.

Go programs are not portable

$
0
0

A while ago I was asked why I wrote Sim in C++ instead of Go. I stumbled upon my answer again and realized it could be a blog post.

So here’s what I wrote then. I think I stand by it still, and I don’t think the situation has improved.

Why not write portable system tools in Go

My previous experience with “low level” things in Go (being very careful about which syscalls are used, and in which order) has had some frustrations in Go. Especially with portability. E.g. different definitions of syscall.Select between BSDs and Linux, making me have to use reflection at some points. (e.g. see this Go bug.

And to work around those things Go unfortunately uses the antipattern of (essentially) #ifdef __OpenBSD__, which we’ve known for decades is vastly inferior to checking for specific capabilities.

To me the Go proverb “Syscall must always be guarded with build tags” essentially means “Go is not an option for any program that needs to be portable and may at some point in the future require the syscalls package”. And since this tool is meant to be portable, and calls what would be syscall.Setresuid, I’m essentially told by the proverb to choose between proper system capability checking, or Go. And Go loses.

It’s a shame that in my opinion Go chose to go with the known bad solution when it’s known to be bad. It feels to me like “there’s no obvious amazing solution so let’s not try to solve it at all”.

In other words I’ve not seen a good Go solution to the problems autotools solves laid out in this blog post.

I’m planning to add a web UI though. Likely some of that will be in Go.

Ugh, except there might be a problem with that plan, because I need to pass creds over a socket, and Go only has support for that in Linux:

$ grep Getsockopt.*red ~/go/src/golang.org/x/sys/unix/*.go
[…]/syscall_linux.go:func GetsockoptUcred(fd, level, opt int) (*Ucred, error) {

So how should Go make this better?

I don’t know what would be best for Go, here. I know what really works well is the autoconf capability way. And what really works poorly is “if OpenBSD do this, if FreeBSD do that, …”.

Since I’m not interested in if FreeBSD has pledge() or not (in fact since what version of FreeBSD, if so?). Maybe one day Linux will get support. How long do I wait before I drop support for older Linux setups?

I’ve not looked much into how to write these portable packages, but how do you write them? Do you assume that Linux has getrandom(), and other OSs don’t? Can the Go way have build tags about “OpenBSD 5.9 or newer”? What if an API gets deprecated or removed? What if Solaris switches to POSIX getpwnam_r()?

I don’t want to know which OSs have and not have openpty(). I want to support OSs that have openpty(), and ones that don’t.

The Go way maybe works well if both of these hold:

  1. You can mandate that users install updates. Anything older than X you say you just plain don’t support. This can work great in corporate environments with managed machines.

  2. You only support architectures that you personally will actually run on. Again this will work well in corporate environments.

It’s not like select() is not portable, even though it’s in section 2 of the unix manual (i.e. a syscall). I’ve been able to maintain opensource working on many OSs including Mac, without ever having owned a Mac, for decades.

And I have confidence (from experience) that the changes I make will actually work on all architectures. With OS-name based checking I have to actually try to compile on all those and confirm correct behaviour.

I no longer have access to an IRIX machine, but I bet you arping still works just fine on it.

To say that even POSIX is platform-specific and revert to “is OS” instead of “has feature” is to throw the baby out with the bathwater, and give up on portability.

Most differences can be avoided. E.g. the value of the timeout after calling select(). And the ones that can’t be avoided that way you can almost always just check at build time which behaviour you have.

I also don’t know how to call “pledge()” with Go’s model. Call a general “drop some of the privileges”? Call a “pledge_if_you_can()” and check return values? (and differentiating at runtime between “failed” and “not available on this system”?)

But going back to what would be most useful for me for Go:

Not having written such implementation stuff it’s hard to criticise. And I’ve not done so because “it’s not going to be portable anyway” because of the rest of Go. And I’ve been assuming that I won’t be able to to do “HAVE_x” anyway.

Maybe if build tags were things like “HAVE_PLEDGE”, instead of “is OpenBSD”. Combined with causing all transitive dependent packages to fail compile if they use Is <OS name here> as a build tag. And of course some equivalent of ./configure that finds all the HAVE_x stuff.

In other words: I’ve not written these portable packages because it doesn’t look possible to write them to be truly portable in Go, because of these “is” not “have”.

For “is” vs “have” the one exception, I’d say, is Windows. Sometimes just everything is so different that it makes sense to have two copies of code achieving the same goal, because almost every line would be either the Windows way, or the not-Windows way.

Localisation isn't translation

$
0
0

If you only have your app in English then you’ll still be understood[1] by the new market whose official language isn’t English.

If you show farenheit (a word I can’t even spell), then 96% of the world cannot understand your app. At all.

For most of the west I would argue that translation doesn’t even matter at all, but you cannot have your app start your weeks on Sunday, you cannot show fahrenheit, or feet, or furlongs, or cubits or whatever US-only units exist. And you cannot use MM/DD/YY.

NONE of these things are tied to language. Most users of English don’t want any of this US-only failure to communicate.

[1] While most of the world doesn’t speak English fluently, they may know words. And they can look up words. You cannot “look up” understanding fahrenheit or US-only date formats.

AX.25 over D-Star

$
0
0

Setting up AX.25 over 1200bps was easy enough. For 9600 I got kernel panics on the raspberry pi, so I wrote my own AX.25 stack.

But I also want to try to run AX.25 over D-Star. Why? Because then I can use radios not capable of 9600 AX.25, and because it’s fun.

It seems that radios (at least the two I’ve been working with) expose the D-Star data channel as a byte stream coming over a serial connection. Unlike working with a TNC you don’t have to talk KISS to turn the byte stream into packets, and vice versa.

IC9700 setup

The first hurdle to overcome, because we want to send binary data, is to escape the XON/XOFF flow control characters that the IC9700 mandates. Otherwise we won’t be able to send 0x13 or 0x11. Other bytes seem to go through just fine.

So I wrote a wrapper for that, taking /dev/ttyUSB1 on one side, and turning it into (e.g.) /dev/pts/20 for use with kissattach.

$ ./dsax /dev/ttyUSB1
/dev/pts/20
$ kissattach /dev/pts/20 radio
$ kissattach -p radio -c 2     # See below

Set Menu>Set>DV/DD Set>DV Data TX to Auto, for “automatic PTT”. As soon as there’s even one byte written to the serial port, it’ll start to transmit. Also you probably want to turn on Menu>Set>Dv/DD Set>DV Fast Data>Fast Data, to increase speed from (according to the manual) about 950bps to 3840bps.

Kenwood TH-D74 setup

D74 does not use XON/XOFF (Operating Tips manual section 3.3), but because we escaped/unescaped the characters on the IC9700 side, we need to do the same on the D74 side.

D-Star data is delivered on Bluetooth channel 2, so a serial decive can be created using rfcomm bind /dev/rfcomm0 $D74_MAC_ADDR 2.

On KISS

Note that I’m actually speaking KISS between the two kernels. There’s no real TNC, just a byte stream.

Collision avoidance

In my testing it seems that both radios are clever enough not to start transmitting in the middle of receiving, and wait nicely for their turn.

Bug: No flow control actually used by IC9700

The IC9700 does not seem to actually use its XON/XOFF to slow down the data it’s being sent. If I set the serial port to 9600 it’ll lose a lot of data. If I set it to 1200 it’ll actually still lose a lot of data (even when in Fast Data mode).

It seems to have a really shallow buffer, too. If I set manual PTT mode (Menu>Set>DV/DD Set>DV Data TX to PTT and write a bunch to the serial port, then trigger PTT, only the last 430-474 or so seem to actually be sent.

$ stty -F /dev/ttyUSB1 raw -echo ixon ixoff 9600
$ cat dsax.cc > /dev/ttyUSB1          # Takes a while
$  # press PTT now

In fact I had to turn on CRC because parts of frames were being dropped. kissparms -p radio -c 2 (on both sides)

According to Operating Manual 3.2.2 the D74 has transmit and receive buffer sizes are 3kB and 4kB respectively. That’s for KISS mode, so I’m just assuming the same for D-Star data.

Bug: The D74 turns off the rx light and stops beeping if tx while rx

If the D74 is receiving, and you send it data, then it’ll turn off the receive light for the remainder of the reception. It will also turn off the speaker, which until then went beep once a second while data was being transferred.

Bug: The D74 does not send data that was queued during a receive

So while it doesn’t interfere with what’s in progress, it does drop packets on collision. This is a bug I have to work around in my XON/XOFF wrapper.

Until that’s implemented expect lots of dropped packets.

Conclusion

So is this a good idea? Maybe. If you want to do AX.25 over a D-Star reflector then yes. If you want data access over a D-Star repeater, then this is where the data channel is.

Of course, it may be better in that case to implement a protocol (maybe AX.25) over AX.25 UI frames and use the standard AX.25 infrastructure for your point to point.

D-Star does support higher speeds (3840bps, and even 128kbps at 1.2Ghz), so could be worth playing with.

But I wouldn’t do it with the IC9700. The bug where it loses data (presumably due to not actually sending XON/XOFF) makes it about half the speed it should be, on the retransmits.

Everything was fine between two D74s. So as soon as I work around the bug where it won’t queue a send if it’s currently receiving, seems pretty cool.

The firmware on the radios were the latest. 1.31 for IC9700, and 1.11 for D74.

Future

Since this is how pictures are sent over D-Star, I’d like to be able to decode the image format.

seccomp — Unsafe at any speed

$
0
0

I’ll just assert that there’s no way to use seccomp() correctly. Just like how there’s no way to use gets() correctly, causing it to eventually be removed from the C and C++ standards.

seccomp, briefly

seccomp allows you to filter syscalls with a ruleset.

The obvious thing is to filter anything your program isn’t supposed to be doing. If it doesn’t do file IO, don’t let it open files. If it’s not supposed to execute anything, don’t let it do that.

But whether you use a whitelist (e.g. only allow working with already open file descriptors), or a blacklist (e.g. don’t allow it to open these files), it’s fundamentally flawed.

1. Syscalls change. Sometimes without even recompiling

open() in your code actually becomes the openat syscall. Maybe. At least today. At least on my machine, today.

select() actually becomes pselect6. At least on Fridays.

If you upgrade libc or distribute a binary to other systems, this may start to fail.

2. Surprising syscalls

Calling printf() will call the syscall newfstatat, a syscall hard to even parse into words. But only the first time you call it! So after your first printf() you can block newfstatat.

Maybe this will all work just fine, normally. But then an unrelated bug happens, and your tool tries to log it, but can’t because newfstatat is blocked. So you get no logs.

So it’s not just what you call, but highly dependent on what order you call things when dropping privileges.

In my example it worked fine when I ran with verbose mode turned on, but not with it off. That’s because in verbose mode I called printf() before dropping privs.

3. (hinting at the solution): There’s no grouping

I would say that the most common thing everyone wants to do is this: After everything’s set up, don’t allow anything done by the process to interact with anything else, except via already open file descriptors.

That’s almost true. Getting the current time, and memory allocation, is probably also safe.

(But the original binary on/off seccomp() blocked even those)

But there’s no way to express this. In order to actually interact with open network sockets in the most minimal of ways I’d need at least:

  • pselect6
  • select
  • poll
  • ppoll
  • write
  • pwrite64
  • writev
  • pwritev
  • read
  • pread64
  • pread
  • preadv
  • close
  • sendfile
  • sendto
  • sendmsg
  • sendmmsg
  • recvfrom
  • recvmsg
  • recvmmsg

And that’s just for the most trivial of examples where you have some unsafe code (e.g. a parser) that takes input on one fd and gives output on another. For example if you implement an oracle that takes an X.509 certificate (famously tricky to parse) and a hostname, and returns if it’s valid or not.

And what’s worse: This is completely dynamic and depends on the architecture. It can change from execution to execution, or millisecond to millisecond. This is just not part of the ABI.

There’s nothing stopping libc from changing to implementing read() as a special case of readv(). select() could be implemented in terms of poll(), tomorrow.

There are 300+ syscalls, and will likely grow. Do you know which ones are “just read or write from the sockets”?

So I don’t think the seccomp(2) manpage is realistic when it says:

It is strongly recommended to use an allow-list approach whenever
possible because such an approach is more robust and simple.  A
deny-list will have to be updated whenever a potentially dangerous
system call is added

Good luck with that.

The solution

OpenBSD clearly got this right. Don’t list syscalls. Who cares if it’s poll() or select()?

For example arping has this code to prevent it doing anything bad at all.

Go on, think about it. Even with full control of the process, what could you possibly do after it runs pledge("stdio", "")? Print profanities to the user? Exit with the wrong exit code? Yeah, but that’s about it.

But seccomp() allows more restrictions. In arping I blocked so that it can only write to stdout and stderr, not read. But so what? I may have to eat my hat on this, but being able to read from stdout doesn’t sound like it’ll cause a security problem.

pledge(), and unveil(), are clearly the right solution here.

But what about Linux?

Maybe one day Landlock will be the thing. But considering the previous nightmare with many generations of Linux solutions getting it wrong I’m not holding my breath.

For now I guess unshare() is the way to go. But even that’s tricky (and doesn’t block as much). I’m planning a follow-up post about how to drop access to the outside world using available tools.

Dropping privileges

$
0
0

If you’re writing a tool that takes untrusted input, and you should treat almost all input as untrusted, then it’s a good idea to add a layer of defense against bugs in your code.

What good is a buffer overflow, if the process is fully sandboxed?

This applies to both processes running as root, and as normal users. Though there are some differences.

Standard POSIX

In POSIX you can only sandbox if you are root. The filesystem can be hidden with chroot(), and you can then change user to be non-root using setuid() and setgid().

There have been ways to break out of a chroot() jail, but if you make sure to drop root privileges then chroot() is pretty effective at preventing opening new files and running any new programs.

But which directory? Ideally you want it to be:

  • read-only by the process (after dropping root)
  • empty
  • not shared by any other process that might write to it

The best way no ensure this is probably to create a temporary directory yourself, owned by root.

This is pretty tricky to do, though:

// Return 0 on success.
int do_chroot()
{
  const char* tmpdir = getenv("TMPDIR");
  if (tmpdir == NULL) {
    tmpdir = "/tmp";
  }
  char jail[PATH_MAX];
  if (0 > snprintf(jail, PATH_MAX, "%s/jail-XXXXXX", tmpdir)) {
    // If truncated then mkdtemp() will complain.
    perror("snprintf()");
    return 1;
  }
  if (mkdtemp(jail) == NULL) {
    perror("mkdtemp()");
    return 1;
  }
  if (chdir(jail)) {
    perror("chdir()");
    return 1;
  }
  // Caveat: Deleting the current working directory and then chrooting into it
  // may not be portable. If it's not, then skip the rmdir step and just leak
  // the directory.
  if (rmdir(jail)) {
    perror("rmdir()");
  }
  if (chroot(".")) {
    perror("chroot()");
    return 1;
  }
  return 0;
}

Ok, now the filesystem is gone. Well, do make sure that you don’t have any file descriptors open to any directories, or openat() could be used to open a file outside the chroot.

The second POSIX step is to drop root privileges.

First we need to drop any group IDs (GID), since we can only do that while running as user ID (UID) root.

Then we drop UID from root.

// Return 0 on success.
int drop_uid(uid_t uid, gid_t gid)
{
  if (setgroups(0, NULL)) {
    perror("setgroups(0, NULL)");
    return 1;
  }
  if (setgid(gid)) {
    perror("setgid()");
    return 1;
  }
  if (setuid(gid)) {
    perror("setuid()");
    return 1;
  }
  return 0;
}

What UID to use? Well, first of all you need to resolve the user before you do your chroot, since the mapping exists in /etc/passwd and /etc/group. So that’s some boilerplate getpwnam().

Ideally you should run every binary as a separate user, since otherwise they can send signals (kill()) each other. That’s not always feasible though, so maybe the least bad option is nobody/nogroup.

Except if you need to access the network, then on Android you need to use the group inet.

Well, that was easy, right?

OpenBSD

OpenBSD has pledge() and unveil(), which work even on unprivileged users. So if running as root you should first drop your root privileges, and then call ‘pledge()’ and ‘unveil()’ to only list what’s needed from here.

// Return 0 on success
int openbsd_drop()
{
  if (unveil("/", "")) {
    perror("unveil()");
    return 1;
  }
  if (pledge("stdio", "")) {
    perror("pledge()");
    return 1;
  }
  return 0;
}

For belts and suspenders you can chroot, drop root privs, unveil, and then pledge. In that order. For non-root processes just unveil and pledge.

OpenBSD clearly has the best solution, here. You can even skip unveil() in this case, since pledge("stdio", "") doesn’t allow opening any new files.

One single call to sandbox to a very common post-init setup. Nice.

Linux

As I said in a previous post, seccomp is basically unusable.

But there’s two other ways to restrict a process’s ability to affect the outside world:

  1. Capabilities
  2. unshare()

Linux — Capabilities

Capabilities are a fine grained way to give “root powers” to processes.

A process running as user root can have all its special powers taken away, by stripping its capabilities. But that still leaves it as a normal user, and a normal user that owns some key files, like /etc/shadow and… uh… the root directory. So basically the whole filesystem.

So the root user is powerful even without any capabilities.

This is why it’s not enough to merely drop all capabilities.

Capabilities can also be granted to non-root processes. E.g. you can remove the suid bit and grant /bin/ping just the CAP_NET_RAW capability instead. In fact that`s the case on my system:

$ getcap /bin/ping
/bin/ping cap_net_raw=ep

A full compromise of ping can only lead to sniffing my traffic. Bad, but full root access is worse.

If you can avoid having your tool run as root in the first place, that’s strictly better. But still don’t forget to drop that capability as soon as you no longer need it. ping does this:

$ sudo strace -esocket,capset ping 8.8.8.8
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=1<<CAP_NET_RAW, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
socket(AF_INET, SOCK_RAW, IPPROTO_ICMP) = 3
socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6) = 4
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=120 time=9.48 ms

Uhm… hmm… actually no that’s not right. It drops the capabilities from the effective set, but surely it should drop it from the permitted set too?

Gah, it does for IPv6, but not IPv4:

$ sudo strace -esocket,capset ping -6 ns1.google.com
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6) = 3
PING ns1.google.com(ns1.google.com (2001:4860:4802:32::a)) 56 data bytes
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 4
64 bytes from ns1.google.com (2001:4860:4802:32::a): icmp_seq=1 ttl=110 time=14.2 ms

Looks like it’s been broken since 2012. I’ve sent a pull request.

I’m actually not convinced it’s a good idea to replace root with just CAP_NET_RAW in this case. If there’s a security hole in ping then my normal user gets compromised. If ping also had the CAP_SETUID capability then it could limit its blast radius to the nobody user instead.

Compromising my user account is worse than sniffing my traffic, since basically all traffic is encrypted nowadays.

As it is now a bug in ping can lead to a complete account takeover, and system ownage.

I’ve filed a bug requesting some thoughts on this.

Linux — unshare()

unshare() creates a new universe that can never be joined back to the old one. Instead of dropping root privileges, you can create a new namespace where even root can’t affect anything important. And then you can drop privileges inside even that universe.

It’s a bit tricky to use, though. And there are some gotchas. Yes, trickier than chroot() + setuid().

E.g. if you create a new namespace as root then you may be under the misapprehension that you have no way to touch the outside world, but that’s not the case.

real$ sudo unshare --user
new$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
new$ touch /only-root-can-break-your-heart && echo success
success
new$ ls -l /only-root-can-break-your-heart
-rw-r----- 1 nobody nogroup 0 Mar 11 15:49 /only-root-can-break-your-heart
new$ exit
real$ ls -l /only-root-can-break-your-heart
-rw-r----- 1 root root 0 Mar 11 15:49 /only-root-can-break-your-heart

So you need to drop privileges before creating the new namespace, too.

Linux — Combining them all

Let’s say you start off with root, and you want to:

  • chroot
  • drop capabilities
  • unshare
  • change uid

The restrictions are:

  • unshare(CLONE_NEWUSER) before chroot, because CLONE_NEWUSER is not allowed in chrooted environment.
  • chroot() before drop_uid(), because you can’t chroot() as non-root
  • drop_uid() before unshare(CLONE_NEWUSER), because the new user namespace still maps back to the real root user.

Oh… and now it looks like we have a circular dependency.

But not really. What you can do is run chroot()afterunshare(CLONE_NEWUSER), because while you aren’t real root, you have all the capabilities inside your new domain:

$ capsh --decode=$(unshare --user --map-root-user awk '/^CapEff/ {print $2}' /proc/self/status)
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore

--map-root-user here can be a bit confusing here. If your program doesn’t do the extra work this option does then you may be fooled into thinking all the capabilities are lost by default:

$ capsh --decode=$(unshare --user awk '/^CapEff/ {print $2}' /proc/self/status)
0x0000000000000000=

but unfortunately that’s not the case. The capabilities are there until you exec() something (in this case awk), because capabilities are not by default part of the inherited set here.

To illustrate:

$ cat not_gone.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main()
{
  if (unshare(CLONE_NEWUSER)) {
      err(EXIT_FAILURE, "unshare(CLONE_NEWUSER): %s", strerror(errno));
  }
  printf("--- Actual permissions ---\n");
  FILE* f = fopen("/proc/self/status", "r");
  if (!f) {
    err(EXIT_FAILURE, "fopen(/proc/self/status): %s", strerror(errno));
  }
  char* line = NULL;
  size_t len = 0;
  while (getline(&line, &len, f) != -1) {
    if (!strncmp(line, "Cap", 3)) {
      printf("%s", line);
    }
  }
  fclose(f);
  printf("--- Post-exec permissions ---\n");
  execlp("grep", "grep", "^Cap", "/proc/self/status", NULL);
  err(EXIT_FAILURE, "execlp: %s", strerror(errno));
}
$ ./not_gone
--- Actual permissions ---
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
--- Post-exec permissions ---
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

This is usually what a user wants, so it makes sense that unshare() would work this way. This allows some setup before executing the ultimate command, and optionally adding to the inherited set of permissions.

In our case though we’re just dropping privileges, and we are the ultimate command (there will be no exec()) so we need them gone now, not merely prevent inheritance.

This is a second gotcha, because it’s easy to be fooled into thinking all the capabilities are gone already, just because getuid() returns nonzero.

Other namespaces

Once all needed network sockets have been opened we can drop support for creating new ones. And we can detach from other namespaces too.

Making the chroot dir read only

There appears to be two ways to create a “safe” working directory.

As far as I can tell, after deleting the current directory we’re chrooted into it’s impossible to create any new files or directories in it. It’ll fail with ENOENT.

High level

dir = mkdtemp()
chdir(dir)
rmdir(dir)
setuid_setgid(nobody, nogroup);  // This will fail if not root / suid, but that's fine.
drop_capabilities();      // Capabilities in parent namespace.
unshare(CLONE_NEWUSER|CLONE_NEWNS);
chroot(".");
drop_capabilities();      // Capabilities in new namespace.

The downside here is that if the chroot() fails inside the new namespace, but would have succeeded if we’d just gone the POSIX way, then it’s too late to go back and try again.

An alternative way

For the superuser case (where setuid() succeeds) the new root file system is empty, deleted, and owned by a user other than the currently running one.

But in the non-superuser case the new file system is all inside the normal user’s UID. User namespaces merely map their UIDs to real UIDs, they don’t create new ones. Maybe there’s then something they can do to create files in there, and possibly filling up the disk.

I don’t think so, but let’s explore another trick: A cloned mount namespace, with a read-only filesystem. Good luck creating files in a read-only file system.

Unfortunately while it’s possible to rmdir the current working directory, it’s not possible to rmdir a directory that’s a mount point. So here we’d leak the temporary directory.

Unless we don’t create one.

We can mount “over” an existing directory, and use that. Then we won’t leak anything. It can probably be any directory, except the root.

setuid_setgid(nobody, nogroup);  // This will fail if not root / suid, but that's fine.
drop_capabilities();      // Capabilities in parent namespace.
unshare(CLONE_NEWUSER|CLONE_NEWNS);
mount(tmpfs read only on /tmp);
chdir("/tmp");
chroot("/tmp");
drop_capabilities();      // Capabilities in new namespace.

I’ve implemented both, and they work. I’m undecided on which is best.

Actual code

It’s over 100 lines lines, so I’m stuffing it into a project on github, but here’s how it looks when it’s running:

Inspecting it when running as root / suid:

$ ps auxww | grep drop
root      115398  0.6  0.0  17308  6748 pts/13   S+   16:13   0:00 sudo ./drop
nobody    115399  0.0  0.0   2480  1632 pts/13   S+   16:13   0:00 ./drop
$ grep Cap /proc/115399/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ ls -l /proc/115399/{cwd,root}
lrwxrwxrwx 1 root root 0 Mar 11 16:15 /proc/115399/cwd -> '/tmp/jail-W54mvW (deleted)'
lrwxrwxrwx 1 root root 0 Mar 11 16:15 /proc/115399/root -> '/tmp/jail-W54mvW (deleted)'

When running as normal user:

$ ps auxww | grep drop
thomas    115615  0.0  0.0   2480  1724 pts/13   S+   16:16   0:00 ./drop
$ grep Cap /proc/115615/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ ls -l /proc/115615/{cwd,root}
lrwxrwxrwx 1 thomas thomas 0 Mar 11 16:17 /proc/115615/cwd -> '/tmp/jail-G3RXdm (deleted)'
lrwxrwxrwx 1 thomas thomas 0 Mar 11 16:17 /proc/115615/root -> '/tmp/jail-G3RXdm (deleted)'

You can also verify that the network namespaces are different:

$ ls -l /proc/{self,115615}/ns/net
lrwxrwxrwx 1 thomas thomas       0 Mar 11 16:17 /proc/115615/ns/net -> 'net:[4026532208]'
lrwxrwxrwx 1 root   root         0 Mar 11 16:17 /proc/self/ns/net -> 'net:[4026532008]'

This means that (aside from any already open sockets) this process cannot use the network. It doesn’t have any network interfaces. Not even loopback.

To be root or not to be root

With user and mount namespaces you’d think that it doesn’t matter if you’re root or not. You can drop privs equally well anyway.

But really what Linux needs is a setuid_ephemeral() callable by nonprivileged user that sets UID and GID to a one-time ephemeral value. That way normal file system, semaphore, signal management takes care of ACLs. And all tooling can be isolated from each other.

setuid() to nobody/nogroup is better than nothing, but would be better if they could all be unique.

What attack surfaces are still exposed?

Lots still, probably. The process can still kill other processes running as the same user.

pledge() is just so much better than this.

Manpages


Another way MPLS breaks traceroute

$
0
0

I recently got fiber to my house. Yay! So after getting hooked up I started measuring that everything looked sane and performant.

I encountered two issues. Normal people would not notice or be bothered by either of them. But I’m not normal people.

I’m still working on one of the issues (and may not be able to disclose the details anyway, as the root cause may be confidential), so today’s issue is traceroute.

In summary: A bad MPLS config can break traceroute outside of the MPLS network.

What’s wrong with this picture?

$ traceroute -q 1 seattle.gov
traceroute to seattle.gov (156.74.251.21), 30 hops max, 60 byte packets
 1  192.168.x.x (192.168.x.x)  0.302 ms     <-- my router
 2  194.6.x.x.g.network (194.6.x.x)  3.347 ms
 3  10.102.3.45 (10.102.3.45)  3.391 ms
 4  10.102.2.29 (10.102.2.29)  2.841 ms
 5  10.102.2.25 (10.102.2.25)  2.321 ms
 6  10.102.1.0 (10.102.1.0)  3.454 ms
 7  10.200.200.4 (10.200.200.4)  2.342 ms
 8  be2878.ccr21.cle04.atlas.cogentco.com (154.54.26.129)  78.086 ms
 9  be2717.ccr41.ord01.atlas.cogentco.com (154.54.6.221)  137.346 ms
10  be2831.ccr21.mci01.atlas.cogentco.com (154.54.42.165)  97.062 ms
11  be3036.ccr22.den01.atlas.cogentco.com (154.54.31.89)  108.071 ms
12  be3037.ccr21.slc01.atlas.cogentco.com (154.54.41.145)  118.264 ms
13  be3284.ccr22.sea02.atlas.cogentco.com (154.54.44.73)  137.982 ms
14  be2895.rcr21.sea03.atlas.cogentco.com (154.54.83.170)  139.721 ms
15  te0-0-2-3.nr11.b022860-0.sea03.atlas.cogentco.com (154.24.22.134)  139.110 ms
16  38.122.90.122 (38.122.90.122)  139.571 ms^C

I mean besides my ISP running infrastructure on RFC1918 addresses.

The problem is between hop 7 and 8. I just don’t believe it. There’s no way my ISP has a connection to a router 78ms away. cle04, that sounds like the airport code for Cleveland, Ohio. In any case it’s clearly in the US, and my ISP is not transatlantic.

Clearly there are missing router hops. This traceroute is a lie.

Here’s another lying traceroute:

$ sudo tcptraceroute -q 1 netnod.se
Running:
        traceroute -T -O info -q 1 netnod.se
traceroute to netnod.se (192.71.80.67), 30 hops max, 60 byte packets
[…]
 3  10.102.3.45 (10.102.3.45)  9.998 ms
 4  10.102.2.29 (10.102.2.29)  9.981 ms
 5  10.102.2.25 (10.102.2.25)  9.968 ms
 6  10.102.1.0 (10.102.1.0)  6.423 ms
 7  10.200.200.4 (10.200.200.4)  6.029 ms
 8  et48.ro1-stb.netnod.se (81.19.110.46)  40.572 ms
 9  www.netnod.fi (192.71.80.67) <syn,ack>  148.590 ms

No way my ISP peers with Netnod in Sweden directly. Not a chance.

And to get confirmation on that, let’s see who 81.19.110.46 actually is.

$ whois 81.19.110.46
[…]
role:           NTT America IP Addressing
address:        8005 S Chester ST
address:        Suite 200
address:        Englewood, CO 80112
address:        United States
phone:          +1 303 645-1900
remarks:        Abuse/UCE: abuse@ntt.net
remarks:        Network: noc@ntt.net
remarks:        Security issues: security@ntt.net

NTT. That’s way more plausible. NTT is a huge network provider, and I fully believe that my ISP would use them as transit to get to Sweden.

So clearly the packets are going from me, to my ISP, then to NTT, and then to Netnod.

But why am I not seeing the router hops inside NTT?

Brief summary of traceroute

If memory serves then traceroute was not designed into the internet protocols, but more like discovered.

You simply send a packet to the destination, but with a hop limit (poorly named “Time” to live, TTL. It’s not time, it’s hops) set to one. If the destination is more than one hop away, then you’ll get an ICMP TTL Exceeded message back, send by the last router you managed to get to. And then you send another packet with TTL set to 2, and repeat until you get a response from the destination itself.

More info in RFC1393. That traceroute IP option never really happened.

So why would hops be missing?

Sometimes you don’t get an ICMP TTL Exceeded. Often because the it was filtered, but it could also be because routers don’t have much CPU (compared to how much they can route “in hardware”), and may use control plane policing to limit the number of administrative packets like this that the router sends.

But those don’t manifest as a suspicious set of missing hops. They show up as hops with an address of * in the traceroute.

In other words traceroute knows that nobody at all responded when it sent with TTL=5, so on line 5 there will always be a result.

So that’s not it.

TTL modification

The only way a hop can be missing is if a router on the path increases the TTL of a packet.

This should never happen. (famous last words)

You can do this in Linux. For example with:

iptables -t mangle -A PREROUTING -i eth0 -j TTL --ttl-set 64

But, uh, don’t do that at home. If you get a routing loop you could create imortal packets zooming around taking network bandwidth and CPU forever, possibly crashing all your routers and poisoning your cat.

Is my ISP doing this? Maybe. It’s not my primary guess, though. Mostly because it’s actually hard to accidentally, and too dangerous to do on purpose.

MPLS

Let me explain! No, there is too much. Let me sum up.

MPLS encapsulates IP, sends it over the MPLS network, and then decapsulates it in order to send it back out on the other side.

Traceroute by default works fine through an MPLS network, because:

  • On encapsulation it copies the TTL from the IP packet into the MPLS TTL field.

  • For every hop in the MPLS network it decrements the MPLS TTL, but these hops do not look at the encapsulated IP TTL at all.

  • On decapsulation it takes the remaining MPLS TTL and uses it to overwrite the IP TTL, and sends the IP packet on its way.

So the net effect is that the TTL is decremented as if there were no tunneling.

An example TTL of a packet as it goes hop by hop may be:

5 4 3(ip=4) 2(ip=4) 1

A raw number here means IP TTL, and 3(ip=4) means an MPLS packet with TTL 3, encapsulating an IP packet with TTL 4.

MPLS with network hiding

Ugh, I hate this feature so much. Please never use this feature.

You need to understand network hiding to see what the actual problem is, but I want to stress that it is NOT the case that NTT uses network hiding.

It sounds like it would be, but it’s not.

Network hiding changes the way MPLS encap and decap works with regards to TTL. With network hiding the behaviour changes to this:

  • On encap it ignores the IP TTL, and instead uses a high value (like 255) for MPLS TTL.

  • MPLS hops decrement this MPLS TTL value as normal.

  • On decap it sends the IP packet out AS-IS, without changing the TTL.

This means that it’s impossible for users outside the MPLS network to cause the TTL to expire inside the MPLS network. Therefore those hops will never send an ICMP TTL Exceeded, and there will be a suspicious “long hop” in the traceroute, covering the entire MPLS part. Kind of what I’m seeing.

An example TTL of a packet is now:

5 4 255(ip=4) 254(ip=4) 253(ip=4) 3 2 1

The TTL inside the MPLS network is no longer controlled by traceroute.

But that’s not it!

I have done many traceroutes. Some go via NTT. Some are direct peerings with my ISP. And they all show this weird gap. And I know for a fact that some of these networks do NOT use network hiding. In fact some don’t even use MPLS!

The only common factor here is my ISP.

Could it be that my ISP has a firewall rule that bumps up the TTL, before it leaves their network?

Theoretically yes, but… why? And how? This is a dangerous feature, and I would not expect normal routers to even have support for this.

An alternative theory: Misconfiguration

So how do you increase the TTL without increasing the TTL?

For network hiding you need to configure both the encap router and the decap router (ingress and egress) to use the other behaviour.

What if you configure only one side?

In this case it looks like my ISP only did this on the egress side. This means that:

  • On encap the IP TTL is copied to the MPLS TTL, and forwarded.

  • MPLS hops decrement the MPLS TTL. Hops inside my ISP still show up because my traceroute controls the TTL.

  • On decap the remaining MPLS TTL is thrown away, and the original IP TTL from encap is reused!

This means that for as many MPLS hops as we took inside my ISP, counting down the MPLS TTL, we will count them down AGAIN, before TTL expires.

Because of my ISPs configuration there’s no way to have the TTL expire for a few hops AFTER my ISP!

The TTL life is now:

5 4 3(ip=4) 2(ip=4) 1(ip=4) 3 2 1

You cannot set the TTL so that it expires right after the MPLS network. It either expires inside it, or it needs to count down again for a few hops..

And in the Netnod example above it removes all but the last NTT hop.

Can I reproduce this configuration error?

GNS3 is an amazing tool for this. It may not be able to simulate state of the art routers. We don’t need to simulate 400Gbps links, but a Cisco 7200 can do most things, including MPLS.

netmap

The IP address plan in the setup is, in short, that the last octet is the index of the router. So R6 has addresses of the form x.x.x.6.

Traceroutes in both directions when correctly configured:

R1#trace  8.8.8.8 so lo0
Type escape sequence to abort.
Tracing the route to 8.8.8.8
VRF info: (vrf in name/id, vrf out name/id)
  1 12.0.0.2 8 msec 20 msec 12 msec
  2 20.0.23.3 [MPLS: Labels 18/21 Exp 0] 88 msec 124 msec 124 msec
  3 20.0.39.9 [MPLS: Labels 22/21 Exp 0] 92 msec 132 msec 120 msec
  4 20.0.90.10 [MPLS: Labels 16/21 Exp 0] 116 msec 84 msec 84 msec
  5 20.0.40.4 [MPLS: Label 21 Exp 0] 84 msec 76 msec 84 msec
  6 23.0.0.5 80 msec 92 msec 92 msec
  7 30.0.56.6 [MPLS: Labels 17/16 Exp 0] 176 msec 184 msec 188 msec
  8 30.0.67.7 [MPLS: Label 16 Exp 0] 144 msec 164 msec 140 msec
  9 34.0.0.8 192 msec 176 msec 172 msec

(the double MPLS labels are there because I used send-label on the iBGP sessions in order to maximize the MPLS on the wire otherwise lost to penultimate hop popping. You don’t need to know what any of that means)

R8#traceroute 1.1.1.1 so lo0
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 34.0.0.7 12 msec 16 msec 16 msec
  2 30.0.67.6 [MPLS: Labels 16/25 Exp 0] 80 msec 80 msec 48 msec
  3 30.0.56.5 [MPLS: Label 25 Exp 0] 60 msec 48 msec 44 msec
  4 23.0.0.4 40 msec 52 msec 104 msec
  5 20.0.40.10 [MPLS: Label 23 Exp 0] 144 msec 184 msec 180 msec
  6 20.0.90.9 [MPLS: Label 20 Exp 0] 176 msec 180 msec 164 msec
  7 20.0.39.3 [MPLS: Label 17 Exp 0] 188 msec 184 msec 180 msec
  8 20.0.23.2 [MPLS: Label 16 Exp 0] 148 msec 120 msec 152 msec
  9 12.0.0.1 140 msec 164 msec 148 msec

But changing just one thing, adding “no mpls ip propagate-ttl forward” to R4 (in my ISP), suddenly all of NTT (not my ISP!) disappears from the traceroute:

R1#trace  8.8.8.8 so lo0
Type escape sequence to abort.
Tracing the route to 8.8.8.8
VRF info: (vrf in name/id, vrf out name/id)
  1 12.0.0.2 12 msec 20 msec 16 msec
  2 20.0.23.3 [MPLS: Labels 18/21 Exp 0] 84 msec 132 msec 112 msec
  3 20.0.39.9 [MPLS: Labels 22/21 Exp 0] 124 msec 92 msec 104 msec
  4 20.0.90.10 [MPLS: Labels 16/21 Exp 0] 128 msec 120 msec 92 msec
  5 20.0.40.4 [MPLS: Label 21 Exp 0] 96 msec 96 msec 104 msec
  6 34.0.0.8 168 msec 140 msec 168 msec

Traceroute in the other direction when propagate-ttl is disabled on R4 makes NTT perfectly visible, but hides some of my ISP, as expected:

R8#traceroute 1.1.1.1 so lo0
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 34.0.0.7 16 msec 8 msec 8 msec
  2 30.0.67.6 [MPLS: Labels 16/25 Exp 0] 40 msec 60 msec 40 msec
  3 30.0.56.5 [MPLS: Label 25 Exp 0] 60 msec 48 msec 56 msec
  4 23.0.0.4 80 msec 72 msec 80 msec
  5 20.0.23.2 [MPLS: Label 16 Exp 0] 148 msec 160 msec 164 msec
  6 12.0.0.1 144 msec 168 msec 144 msec

So this experiment confirms that the theory explains what I’m seeing.

Side track: What if only encap has it turned on?

The quick reader may wonder if only the ingress router has TTL propagation off then perhaps TTL will get bumped from 4 or so all the way up to 250+. That is, like so:

5 4 255(ip=4) 254(ip=4) 253(ip=4) 252 251

That would make sense, per my description above about decap. It’s supposed to take the MPLS TTL and overwrite the IP TTL.

But no, it does not.

This makes sense to me, and is the obviously correct way for a router that does TTL propagation to behave. To take MIN(MPLS TTL, IP TTL).

This is because if the decap router propagates TTL, then it should assume that the encap router did too. And if the encap router does, then the MPLS TTL will always be lower than the IP TTL.

If it sees an MPLS TTL greater than IP TTL, then it knows that there is a mismatch, and it does the safe thing to avoid imortal packets and router crashes / network collapse.

This “do the right thing” can only be done when TTL propagation is turned on (the default). When turned off there’s no way for the router to know which TTL is correct, so it has to use the IP TTL in all cases.

So is that what my ISP is doing?

Maybe. It fits. I’m emailed their support, and am awaiting a reply.

I swear life would be so much easier for everyone involved if any ISP I get would just give me admin access to their network.

Of course the first thing I’d do would be to set up IPv6…

Java — A fractal of bad experiments

$
0
0

The title of this post is clearly a reference to the classic article PHP a fractal of bad design. I’m not saying Java is as bad as that, but that it has its own problems.

Do note that this post is mostly opinion.

And I’m not saying any language is perfect, so I’m not inviting “but what about C++’s so-and-so?”.

What I mean by “bad experiments” is that I don’t think the decisions the creators of Java were bad with the information they had at the time, but that with the benefit of hindsight they have proven to be ideas and experiments that turned out to be bad.

Ok, one more disclaimer: In some parts here I’m not being precise. I feel like I have to say that I know that, to try to reduce the anger from Java fans being upset about me critiqueing their language.

Don’t identify with a language. You are not your tool.

Too much OOP

A lot of Java’s problems come from the fact that it’s too object oriented. It behaves as if everything is axiomatically an object.

No free-standing functions allowed. So code is full of public static functions, in classes with no non-static methods at all.

Object.class is an object, so it can be passed in as an object, to create the ugliest source of runtime type error crashes I’ve ever seen.

Nothing like waiting three hours for a pipeline to finish running, only for it to fail at a final step because of a type error, in what was supposed to be a statically typed language.

Too much heap use

The language doesn’t allow for objects allocated outside the heap. Everything is just an object where the programmer is not supposed to care about where it lives.

Not only is this a problem for readers of the code, but it also makes writing garbage collectors much harder.

Java may have expected that a “sufficiently smart garbage collector” would solve this. It has turned out that the garbage collector needs help from the language to do a good job.

Go does this much better. It does escape analysis on local variables, thus reducing heap use. It also composes objects into its structs, so that one object with 10 (non-pointer) subobjects becomes just one object in memory, not 11.

Anyone who’s ever needed to run a production service written in Java can attest to how much care and feeding the GC needs. These problems are not inherent to a GC, but ultimately come from the design of the Java language.

So it’s not that Go doesn’t have as advanced GC as Java, it’s that it doesn’t even need it.

When is the file opened?

A long time ago now I made a small tool in Java that would take a file, and upload it to a server.

The part that would be easiest, or so I thought, would be to simply read the file. Something like:

File file = new File(filePath);
FileInputStream fileToUpload = new FileInputStream(file);
byte[] buffer = new byte[size];
int read = fileToUpload.read(buffer);
byte[] bytesRead = Arrays.copyOf(buffer, read);

Now, clearly this will throw an exception if the file doesn’t exist (oh, I’ll get to that, believe me). But where?

Which line throws an exception?

Honestly I don’t remember anymore, but I do remember that it wasn’t the one I first thought.

And I remember at the time showing this code to more experienced Java programmers, and they all got it wrong too.

You could call me a terrible Java programmer. And everyone I asked was too. But you can’t deny that this is about as simple a question as you can get about error handling, and it says something about the language if this many people get it wrong.

Terrible error messages

Once upon a time this issue affected C++. GCC has gotten much better with this over the years. If Java was ever good at it, then it sure isn’t now.

Like with the other problems I see where the good intentions came from.

Someone looked at C++ error messages, specifically involving std::string and how there’s huge basic_string<…> everywhere, and decided that wouldn’t it be nice if that template expansion were just an appendix?

Does it really help, though? I’ve had single character errors produce 20-30 lines of this:

C#3 extends Foo<Pair<K#3,V#3>> declared in method <K#3,V#3,C#3>create(Multimap<K#3,V#3>,FooFactory<Pair<K#3,V#3>,C#3>) 
T#2 extends Object declared in method <T#2,C#4>create(TObject<? extends Fubar<T#2>>,FooFactory<T#2,C#4>)      
C#4 extends Foo<T#2> declared in method <T#2,C#4>create(TObject<? extends Fubar<T#2>>,FooFactory<T#2,C#4>)

How is that helpful? How did it manage to be less readable than C++ error messages from 20 years ago?

Virtual machine bytecode

In an interview Gosling has said that the great idea for a Java bytecode VM came from creating an interpreter for some Pascal pcode.

Basically they had some Pascal code that needed to run on another machine, so they decided to interpret the intermediate format, instead of recompiling.

I know that recompiling isn’t as easy as it should be. C & C++ code needs to not depend on size of pointers, endianness, and various other architecture-specific things, in order to be source code portable.

C++ needs to be source code portable to be actually portable.

Java assumed that being binary portable matters. It does not.

The property of “write once, run anywhere” (WORA) does not require a deliverable that is bytecode, and in any case “write once, run anywhere” does not mean “compile once, run anywhere”.

For a simple example of this see Go. It has a cross compiler built in, so writing once and running anywhere just means that you have to create a for-loop to build all the binaries.

WORA would have made sense if Java had won the browser extension war, but it hasn’t. Javascript clearly won. It’s over. And if it does get replaced then it won’t be by Java, but maybe something like webassembly.

WORA isn’t even true. I have jar files from 20 year ago that just don’t run anymore. Others do. But seems about as hit and miss as my 20 year old C++ code. At least the C++ code that needed a fix to work was always broken, it was just that the compiler became pickier.

Java saw the pipeline of source→IR→machine code and decided to make the public interface not the IR, but the machine code.

This doesn’t make sense to me. Under what circumstances is it inconvenient to port a compiler backend to a platform, but not to port the JRE?

Why waste transistors in a SIM card to run Java bytecode, when it could run anything you compile for it?

Java bytecode is pretty much IR. Fine, but why are you executing your IR? Or if you’re not, why is your runtime environment including a compiler backend?

This decision doesn’t make any sense on the machines we actually ended up executing code.

So you could do all what Gosling mentions in the interview, without the drawback of not having an actual executable binary.

UTF-16

Born too late to not handle unicode at all. Born too early to know that UTF-8 is the obviously right choice.

UTF-16 is terrible. It takes up twice as much space as UTF-8, yet is not fixed width so it also doesn’t get the benefits of constant time counting of code points like UTF-32 does..

RAM has gotten bigger, but taking up twice as much CPU cache will never not have downsides.

And of course UTF-16 means having to deal with the hell of byte order marks (BOMs).

Fixed memory pool size

Java isn’t the only language that does this. I remember some Lisp implementation that just allocated a few gigs of RAM to work in, and relied on page faulting to death if there was no physical memory to back it.

Incidentally this doesn’t work on OpenBSD, or on Linux if overcommit is turned off. You just can’t run any program written in this language if vm.overcommit_memory=2.

Because Java is a virtual machine it needs a certain amount of memory. It simply grabs this as a huge chunk, and tells the OS to stay out of it.

Sure, on that level it’s similar to sbrk or mmaping anonymous pages, and having libc allocate objects in there. But libc does that on demand. You don’t have to tell libc how big you want your heap to be. Why would you? That would be madness. That’s the computers job.

But if you’ve not had to deal with -Xms, -Xmx and other options in Java, then you’ve not run a real Java service.

So you have to tweak the GC and the memory allocator. It plays very poorly with the OSs memory management. Great.

Even though (per previous reference) the compaction possibilities enabled by taking over this much is basically just a patch for a fundamental flaw in the language; the fact that creates fragmented memory in the first place.

Exceptions for non-exceptional cases

Java throws more exceptions than I can count. Real production environments actually graph exceptions per second.

Exceptions are expensive, and should not be thrown for flow control. Yet they are.

This is not a problem with the language, per se, but with its standard library. And the standard library sets the style for the whole language.

C++ has exceptions. But its standard library doesn’t use them for simple errors. This has led to code generally not using exceptions for normal errors.

I say generally, because it’s not rare for C++ code to overuse exceptions. This is a case of C++ making it too easy to shoot yourself in the foot.

Go has exceptions too. But not only does the standard library not really use them, they’re also very crippled, so nobody else wants to use them. Go discourages this feature by making it bad.

This has led to even less use of exceptions in Go.

C++ in a way also discourages overuse of exceptions, by not having a finally keyword. That’s because finally is a code smell of a language.

C++ has RAII, so there is a natural and MUCH safer method of cleaning up, using shared code for the normal and exception case.

Go has defer, which is a poor man’s RAII. (very poor man’s, as it doesn’t even run at end of scope, but end of function, which makes no sense and causes endless ugly code and bugs).

In all these three languages you need to write exception-safe code (yes, even in Go, and code WILL be buggy if you don’t), but finally is just the worst way possible to handle both code paths.

Conclusion

These are the concrete reasons I can think of right now. But I’m sure I’ll think of more eventually.

I have learned many languages over 30 years. Basic, C, C++, Pascal, PHP, Prolog, Perl, Erlang, Fortran, Go, and more. Naturally there are languages that I like less or like more. But Java is the only language that I hate.

No way to parse integers in C

$
0
0

There are a few ways to attempt to parse a string into a number in the C standard library. They are ALL broken.

Leaving aside the wide character versions, and staying with long (skipping int, long long or intmax_t, these variants all having the same problem) there are three ways I can think of:

  1. atol()
  2. strtol() / strtoul()
  3. sscanf()

They are all broken.

What is the correct behavior, anyway?

I’ll start by claiming a common sense “I know it when I see it”. The number that I see in the string with my eyeballs must be the numerical value stored in the appropriate data type. “123” must be turned into the number 123.

Another criteria is that the WHOLE number must be parsed. It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.

Failing to provide the above must be an error. Or at least as the user of the parser I must have the option to know if it happened.

First up: atol()

InputOutput
123timmy123
99999999999999999999999999999999LONG_MAX
timmy0
empty string0
""0

No. All wrong. And no way for the caller to know anything happened.

For the LONG_MAX overflow case the manpage is unclear if it’s supposed to do that or return as many nines as it can, but empirically on Linux this is what it does.

POSIX says “if the value cannot be represented, the behavior is undefined” (I think they mean unspecified).

Great. How am I supposed to know if the value can be represented if there is no way to check for errors? So if you pass a string to atol() then you’re basically getting a random value, with a bias towards being right most of the time.

I can kinda forgive atol(). It’s from a simpler time, a time when gets() seemed like a good idea. gets() famously cannot be used correctly.

Neither can atol().

Next one: strtol()

I’ll now contradict the title of this post. strtol() can actually be used correctly. strtoul() cannot, but if you’re fine with signed types only, then this’ll actually work.

But only carefully. The manpage has example code, but in function form it’s:

bool parse_long(const char* in, long* out)
{
  // Detect empty string.
  if (!*in) {
    fprintf(stderr, "empty string\n");
    return false;
  }

  // Parse number.
  char* endp = NULL;  // This will point to end of string.
  errno = 0;          // Pre-set errno to 0.
  *out  = strtol(in, &endp, 0);

  // Range errors are delivered as errno.
  // I.e. on amd64 Linux it needs to be between -2^63 and 2^63-1.
  if (errno) {
    fprintf(stderr, "error parsing: %s\n", strerror(errno));
    return false;
  }

  // Check for garbage at the end of the string.
  if (*endp) {
    fprintf(stderr, "incomplete parsing\n");
    return false;
  }
  return true;
}

It’s a matter of the API here if it’s OK to clobber *out in the error case, but that’s a minor detail.

Yay, signed numbers are parsable!

How about strtoull()?

Unlike its sibling, this function cannot be used correctly.

The strtoul() function returns either the result of the conversion or, if  there
was  a  leading  minus sign, the negation of the result of the conversion repre‐
sented as an unsigned value

Example outputs on amd64 Linux:

Input rawInputOutput rawOutput
-1-1184467440737095516152^64-1
-9223372036854775808-2^6392233720368547758082^63
-9223372036854775809-2^63-192233720368547758072^63-1
""just spacesError: endp not null 
-18446744073709551614-2^64+221
-18446744073709551615-2^64+111
-18446744073709551616-2^64Error ERANGE 

Phew, finally an error is reported.

This is in no way useful. Or I should say: Maybe there are use cases where this is useful, but it’s absolutely not a function that returns the number I asked for.

The title in the Linux manpage is convert a string to an unsigned long integer. It does that. Technically it converts it into an unsigned long integer. Not the obviously correct one, but it indeed returns an unsigned long.

Interesting note that an non-empty input of just spaces is detectable as an error. It’s obviously the right thing to do, but it’s not clear that this is intentional.

So check your implementation: If passed an input of all isspace() characters, is this correctly detected as an error?

If not then strtol() is probably broken too.

Maybe sscanf()?

A bit less code needed, which is nice:

bool parse_ulong(const char* in, unsigned long* out)
{
  char ch; // Probe for trailing data.
  int len;
  if (1 != sscanf(in, "%lu%n%c", out, &len, &ch)) {
    fprintf(stderr, "Failed to parse\n");
    return false;
  }

  // This never triggered, so seems sscanf() doesn't stop
  // parsing on overflow. So it's safe to skip the length check.
  if (len != (int)strlen(in)) {
    fprintf(stderr, "Did not parse full string\n");
    return false;
  }
  return true;
}
Input rawInputOutput rawOutput
""just spacesFailed to parse 
-1-1184467440737095516152^64-1
-9223372036854775808-2^6392233720368547758082^63
-9223372036854775809-2^63-192233720368547758072^63-1
-18446744073709551614-2^64+221
-18446744073709551615-2^64+111
-18446744073709551616-2^64184467440737095516152^64-1

As we can see here this is of course nonsense (except the first one). Extra fun that last one. You’d expect that from the two before it that it would be 0, or at least an even number. But no.

That last number is simply “out of range”, and that’s reported as ULONG_MAX.

But you cannot know this. Getting ULONG_MAX as your value could be any one of:

  1. The input was exactly that value.
  2. The input was -1.
  3. The input is out of range, either greater than ULONG_MAX, or less than negative ULONG_MAX plus one.

There is no way to detect the difference between these.

So sscanf() is out, too.

Why does this matter?

Garbage in, garbage out, right? Why does it matter that someone might give you -18446744073709551615 knowing you’ll parse it as 1?

Maybe it’s a funny little trick, like ping 0.

First of all it matters because it’s wrong. That is not, in fact, the number provided.

Maybe you’re parsing a bunch of data from a file. You really should stop on errors, or at least skip bad data. But incorrect parsing here will make you proceed with processing as if the data is correct.

Maybe some ACL only allows you to provide negative numbers, and you use this trick to make it parse as negative in some contexts (e.g. Python), but positive in others (strtoul()).

I even saw a comment saying “when you have requirements as specific as this”. As specific as “parse the number, correctly”?

It should matter that programs do the right thing for any given input. It should matter that APIs can be used correctly.

Knives should have handles. It’s fine if the knives are sharp, but no knife should be void of safe places to hold it.

It should be possible to check for errors.

Can I work around it?

You cannot even assemble the pieces here into a working parser for unsigned long.

Maybe you think you can can filter out the incorrect cases, and parse the rest. But no.

You can detect negative numbers with strtol(), range checked and all, and discard all these. But you can’t tell the difference between being off scale low between -2^64…-2^63, and perfectly valid upper half of unsigned long, 2^63-1…2^64-1.

It’s not a solution to go one integer size bigger, either. long is long long is intmax_t on my system.

So what do I do in practice?

Do you need to be able to parse the upper half of unsigned long? If not, then:

  1. use strtol()
  2. Check for less than zero
  3. Cast to unsigned long

If all you need is unsigned int, then maybe on your system sizeof(int)<sizeof(long), and this can work. Just cast to unsigned int in the last step.

Do you need the upper half? Sorry, you’re screwed. Write your own parser.

These numbers are very high, yes, and maybe you’ll be fine without them. But one day you’ll be asked to parse a 64bit flag field, and you can’t.

0xff02030405060708 cannot be unambiguously parsed by standard parsers, even though there’s ostensibly a perfectly cromulent strtoul() that handles hex numbers and unsigned longs.

Any hope for C++?

Not much, no.

C++ method std::stoul()

bool parse_ulong(const std::string& in, unsigned long* out)
{
  size_t pos;
  *out = std::stoul(in, &pos);
  if (in.size() != pos) {
    return false;
  }
  return true;
}
Input rawInputOutput rawOutput
""just spacesthrows std::invalid_argument 
timmytextthrows std::invalid_argument 
-1-1184467440737095516152^64-1
-9223372036854775808-2^6392233720368547758082^63
-9223372036854775809-2^63-1throws std::out_of_range 

Code is much shorter, again, which is nice.

And std::istringstream(in) >> *out;?

Same.

In conclusion

Why is everything broken? I don’t think it’s too much to ask to turn a string into a number.

In my day job I deal with complex systems with complex tradeoffs. There’s no tradeoff, and nothing complex, about parsing a number.

In Python it’s just int("123"), and it does the obvious thing. But only signed.

Maybe Google is right in saying just basically never use unsigned. I knew the reasons listed there, but I was not previously aware that the C and C++ standard library string to int parsers were also basically fundamentally broken for unsigned types.

But even if you follow that advice sometimes you need to parse a bit field in integer form. And you’re screwed.

Integer handling is broken

$
0
0

Floating point can be tricky. You can’t really check for equality, and with IEEE 754 you have a bunch of fun things like values of not a number, infinities, and positive and negative zero.

But integers are simple, right? Nope.

I’ll use “integers” to refer to all integer types. E.g. C’s int, unsigned int, gid_t, size_t, ssize_t, unsigned long long, and Java’s int, Integer, etc…

Let’s list some problems:

What’s wrong with casting?

Casting an integer from one type to another changes three things:

  1. The type in the language’s type system.
  2. Crops values that don’t fit.
  3. May change the semantic value, by changing sign.

The first is obvious, and is even safe for the language to do implicitly. Why even bother telling the human that a conversion was done?

But think about the other two for a minute. Is there any reason that you want your code to take one number, and because of a type system detail it’ll change it to another number, possibly even changing its sign?

Does your code fulfill its promise if it accepts one number but uses another?

Does this make any sense?

# strace -e setsockopt ping -Q 4294967298 -c 1 8.8.8.8
setsockopt(3, SOL_IP, IP_TOS, [2], 4)   = 0      <-- 4294967298%2^32=2
setsockopt(4, SOL_IPV6, IPV6_TCLASS, [2], 4) = 0
setsockopt(5, SOL_IP, IP_TOS, [2], 4)   = 0      <-- 4294967298%2^32=2

# strace -e setsockopt ping -F -4294945720 -c 1 ::1
[…]
setsockopt(4, SOL_IPV6, 0x20 /* IPV6_??? */, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0TH\0\1\1\0\0\0\0\0\0\0\0\0", 32) = 0
                                                                                  ^^
                                                    Flowlabel TH, eh? ------------/

These are negative numbers, and grossly out of range. They should not be accepted.

Basically most languages behave as if when the teacher says your answer is wrong, you reply that your answer was in a different number system, and declare victory.

Go tries to do it right, but kinda makes it worse

This is a compile time error:

var a int32
var b int8
a = b  // cannot use b (variable of type int8) as type int32 in assignment

Sigh, ok. Why, though? It’s safe by definition. Especially since adding an explicit cast can cause errors in the future.

func foo() int8 {
  return someNumber1
}

func bar() int32 {
  return someNumber2
}

func baz(bool which) int16 {
  if which {
    // Bug? No, on closer inspection it's fine for now.
    // But forcing us to use a cast means this will hide bugs
    // if foo() ever changes.
    //
    //
    // So basically this needless explicit cast is not so much a
    // "cast" as it is "please suppress warnings".
    return int16(foo())
  }
  return int16(bar())     // Bug: loss of data.
}

This could have been avoided if the cast were only needed when information risks being cropped or change semantic value. Then the programmer would know to be careful about casting integers, as any time they’d have to cast manually they’d likely be adding a latent bug.

Go forces the programmer to sweep the problem under the rug. Sprinkle some casts and pretend it’s all fine.

The value is still cropped. It’s still the wrong value.

An example of the wrong value

int count = 1;
while(…getopt()) {
  switch (c) {
  case 'c':
    count = atoi(optarg);
    break;
  }
}

With 32bit int on my amd64 Linux this is what I get:

InputOutputWhat happened?
2^600Wrap :-(
2^610Wrap :-(
2^620Wrap :-(
2^63-2-2Wrap :-(
2^63-1-1Wrap :-(
2^63-1Unspecified
2^64-1Unspecified
2^128-1Unspecified
2^128+1-1Unspecified

The manpage says the converted value or 0 on error. Though it also says atoi() does not detect errors.

What actually seems to be happening is that it gets parsed as a 64bit signed integer, returning 0 on error (not counting trailing non-numbers as error), counting over/underflow as -1, and then cropping the result into a 32bit integer.

Like… what? How is that “convert a string to an integer”? Oh, let’s not get distracted into what I have already blogged about.

The problem here is actually that the implementation of atoi() in GNU libc is:

int
atoi (const char *nptr)
{
  return (int) strtol (nptr, (char **) NULL, 10);
}

The problem isn’t actually the parsing. As I saidstrtol() and strtoll()` are actually the only correct number parsers.

Sure, atoi() doesn’t check for trailing characters. This is part of its API, and a known gotcha.

But that harmless looking (int) is the problem, here. It makes it so that given what are clearly valid numbers as input, the returned value will have three different semantic values.

Note here that atoi() returns a 32bit integer.

RangeSemantics
-2^31 — +2^31-1Correct & obvious
-2^63 — 2^31 or 2^31 — 2^63-1Truncated using modulus arithmetic
<-2^63 or >2^63-1-1

atoi() has a hidden behavioral dependency on the size of long.

The Linux manpage defines the implementation as being a direct call to strtol(), but POSIX only says:

The call atoi(str) shall be equivalent to:
  (int) strtol(str, (char **)NULL, 10)

except that the handling of errors may differ. If the value cannot be
represented, the behavior is undefined.

RETURN VALUE
  The atoi() function shall return the converted value if the value
  can be represented.

Our friend “undefined behavior”. glibc could have chosen to “error” (return 0) when the number cannot fit into the return value type. POSIX does allow for the error handling to be different.

To me the correct implementation of atoi() must be:

int
atoi (const char *nptr)
{
  const long l = strtol (nptr, (char **) NULL, 10);
  if (l < INT_MIN) {
    return 0;
  }
  if (l > INT_MAX) {
    return 0;
  }
  return (int)l;
}

Of course this will give a warning if sizeof long == sizeof int, since the conditionals will in that case always be false.

Cast causes cropping

Seasoned programmers may simply dismiss this as “so what, that’s intended behavior”.

I’m making the case that it’s not.

257 is not 1. 128 is not -128. 2560384 is also not -128. They are different numbers. Changing the type should not change the semantic meaning.

If a user said “do this 2560384 times” then it’s not OK to do it -128 times.

Or if a database transaction says to add 2560384 tokens into a user’s credits, then it’s not OK to subtract 128 tokens.

If you did want cropping, and casting int to int8 for positive numbers, then you can always do:

return (int8_t)(some_int & 0x7f)

Se below for the general case.

Cast can cause sign change

package main

import (
  "fmt""flag""log"
)

var (
  num = flag.Int("num", -1, "Some number. Can't be negative")
)

func process(val int16) {
  if val < 0 {
    panic("Can't happen: We've already checked for negative")
  }
  fmt.Println(val)
}

func main() {
  flag.Parse()
  if *num < 0 {
    log.Exitf("I said -num can't be negative")
  }
  process(int16(*num))
}

Let’s try it out

$ ./a -num=$(python3 -c 'print(2**32-1)')
panic: Can't happen: We've already checked for negative

goroutine 1 [running]:
main.process(0xe180?)
        a.go:15 +0x74
main.main()
        a.go:25 +0x8d

The parsing isn’t the problem. The cast is.

Even haskell is wrong

Prelude> read "9999999999999999999999999999" :: Int
4477988020393345023
Prelude> read "9999999999999999999999999999" :: Integer
9999999999999999999999999999
Prelude> let x = read "9999999999999999999999999999" :: Integer
Prelude> fromIntegral x
9999999999999999999999999999
Prelude> fromIntegral x :: Int
4477988020393345023

So what should I do?

For GCC / Clang languages turn on -Wconversion -Wsign-conversion, to find the problems. Then, as with all warnings, fix all of them.

In C++

One example, that doesn’t log the reason the cast failed, is this:

template<typename To, typename From>
std::optional<To> cast_int(const From from)
{
  // If casting from signed to unsigned then reject negative inputs.
  if (std::is_signed_v<From> && std::is_unsigned_v<To> && from < 0) {
    return {};
  }

  const To to = static_cast<To>(from);

  // If casting from unsigned to signed then the result must be positive.
  if (std::is_unsigned_v<From> && std::is_signed_v<To> && to < 0) {
    return {};
  }

  // If the number fit then it'll be the same number when cast back.
  if (from != static_cast<From>(to)) {
    return {};
  }
  return to;
}

template<typename To, typename From>
To must_cast_int(const From from)
{
  return cast_int<To>(from).value();
}

int main(int argc, char** argv)
{
  // Ignore the parsing problems with atoi() for this example.
  const int a = atoi(argv[1]);

  const auto b = must_cast_int<int8_t>(a);
}

Or if you’re fine with Boost then numeric_cast.

C

For C I guess you have to manually do that code generation, like this mkcast.py (URL subject to change when I merge with main branch).

But POSIX just says that gid_t is an Integer type. Signed or unsigned? Unspecified. Great.

Although there are POSIX functions (e.g. setregid()) that take a gid_t and are use -1 as a special value, implying that gid_t is signed.

Gah, but it also says that they must be positive arithmetic types, which presumably excludes signed!

On my Linux system it’s unsigned, FWIW.

Java

I’m still scarred from before, so leaving this as an exercise for the reader instead of going back into Java land.

Rust

TODO: I only know enough rust to confirm that it’s a problem here too:

fn main() {
    let mut foo:i32 = -123;
    let mut bar:u32 = 0;
    bar = foo as u32;
    println!("Hello world! {}", bar);
}
$ ./r
Hello world! 4294967173

Go

I don’t think the generics typesystem allows this, so you’ll probably have to write a code generator like with C.

Haskell

Disclaimer: I’m a haskell newbie. This may be terrible advice.

  1. Always parse as Integer
  2. When casting to Int then do something like:

There are some other types, like in Numeric.Natural, so maybe there are better solutions.

import System.Environment

safe_cast_int from = safe_cast_int' from $ fromIntegral from :: Int
safe_cast_int' from to = do
  case (fromIntegral to :: Integer) == from of
    True -> to
    False -> error $ "Did not return the same value: in=" ++ (show from) ++ " out=" ++ (show to)

main = do
  args <- getArgs
  let str = head args
  putStrLn $ "Input string: " ++ str
  let big = read str :: Integer
  putStrLn $ "Integer: " ++ (show big)
  putStrLn $ "Int: " ++ (show $ safe_cast_int big)
$ ghc cast.hs && ./cast 123
[1 of 1] Compiling Main             ( cast.hs, cast.o )
Linking cast ...
Input string: 123
Integer: 123
Int: 123
$ ./cast $(python3 -c 'print(2**63)')
Input string: 9223372036854775808
Integer: 9223372036854775808
cast: Did not return the same value: in=9223372036854775808 out=-9223372036854775808
CallStack (from HasCallStack):
  error, called at cast.hs:8:14 in main:Main

Python

Python is actually all good, thanks. int will transparently expand into a bigint, and you can’t cast to smaller integers, so no need for special code.

Fast zero copy static web server with KTLS

$
0
0

I’m writing a webserver optimized for serving a static site with as high performance as possible. Counting every syscall, and every copy between userspace and kernel space.

It’s called “tarweb”, because it serves a website entirely from a tar file.

I’m optimizing for latency of requests, throughput of the server, and scalability over number of active connections.

I won’t go so far as to implement a user space network driver to bypass the kernel, because I want to be able to just run it in normal setups, even as non-root.

I’m not even close to done, and the code is a mess, but here are some thoughts for now.

First optimize syscall count

Every syscall costs performance, so we want to minimize those.

The minimum set of syscalls for a webserver handling a request is:

  1. accept() to acquire the new connection.
  2. epoll_ctl() to add the fd.
  3. epoll_wait()& read() or similar. (ideally getting the whole request in one read() call)
  4. epoll_wait()& write() or similar. (again ideally in one call)
  5. close() the connection.

There’s not much to do about accept() and read(), as far as I can see. You need to accept the connection, and you need to read the data.

Reading the request

You could try to wait a bit before calling read(), hoping that you can coalence the entire request into one read() call. But that adds latency to the request.

You could call getsockopt(fd, SOL_TCP, TCP_INFO, …) and use tcpi_rtt to try to estimate how long to wait, but it depends on too many factors so it’s likely to add latency no matter what you do.

It would also add another syscall, which is what we wanted to avoid in the first place.

It’s probably best to read data as soon as it’s received. That way you can start parsing headers even though the whole request has not been received. When the request is fully received you’re already ready to send the reply.

There’s also nothing to be done about copying data from kernel space to userspace on read(). Short of implementing the webserver in the kernel (possibly using BPF) or a user space network stack, there’s nothing you can do.

If you use nonblocking sockets you can try a read() right after accept(), in case the client has already landed some data. That saves a syscall if correct, and wastes a syscall if incorrect.

Writing the reply

First I implemented memory mapping the file, and using a single writev() call to write both headers and file content in one syscall.

For minimizing number of syscalls this is great. It also has the benefit of being perfectly portable POSIX. For copying data between kernel and user space it’s not so great, though.

So then I implemented sendfile(). It does mean one more syscall (one writev() for the headers, one for the contents), but removes the data copies for the content.

I will add preprocessing of the tar file so that I’ll be able to sendfile() headers and content in one go. That won’t work for Range requests, but most requests are not Range requests.

Compression

It’s simple enough to pre-compress files. So that if a client says that it accepts gzip, you can just serve index.html.gz instead of index.html.

TLS

At first I only targeted HTTP, because surely TLS would always make things slow? Felt a bit anachronistic though. Maybe just place a proxy in front, to do TLS? But that will reduce performance, which is the whole point.

But then I read up more on KTLS. It lets you do the handshake in user space, but then hand off the session and symmetric encryption to the kernel. The socket file descriptor transparently becomes encrypted.

This allows me to use the same writev() and sendfile() optimizations after the TLS handshake. If network hardware allows, this can even offload TLS to the network card. I don’t have such hardware, though.

OpenSSL supports KTLS, but as of today not for TLS 1.3 on the read path. So I had to hardcode using TLS 1.2. Alternatively I could use SSL_read() for the read path, and plain sendfile() on the write path.

So what’s the performance?

There are two ways to measure that. One is how fast you can go on high end hardware. But for something like this you’ll need really fancy stuff, and dedicated machines. And that’s just for the server itself. You’ll also need beefy load generators.

And network gear capable of hundreds of Gbps.

The other is to go with low end hardware, so you can easily saturate it.

I went with the latter. Specifically a Raspberry Pi 4.

The downside to the Raspberry pi in this regard is that on HTTP it’s not hard to completely saturate the 1Gbps network interface, and with HTTPS it doesn’t have AESNI, so the results don’t scale up to what someone would actually run on a real server.

An alternative is to compare resource usage per request. E.g. how much CPU does writev() use compared to sendfile() for HTTP? And then we can compare that to HTTPS.

When we look at how much CPU time is used we don’t even have to put the system under 100% load.

writev() vs writev()+sendfile()

writev() can write headers and content in one syscall. sendfile() requires two. One for the headers and one for writev(). So writev() should be faster for small files, and sendfile() faster for large files.

Looks like the cutoff point is 1-2 MB. But the signal is noisy.

I ran 100 requests at every MB from 1 to 100. Then 1000 requests between 0 and 10MB at every 20kB.

writev() vs sendfile()writev() vs sendfile()

Note that CPU use isn’t everything. At higher speeds you should expect the memory bandwidth to be the bottleneck. sendfile() may be a small gain in CPU, but should be a bigger deal for shuffling bytes across buses.

Supposedly a Raspberry Pi has over 4 GBps memory bandwidth, so about 32x what the built in wire network interface can push out.

Future work

Benchmark KTLS vs normal TLS vs HTTP

I would’ve done this, but the default Raspberry Pi kernel doesn’t include KTLS. This is the second time I’ve found it underconfigured. It also doesn’t include nftables bridge support.

Bleh, why not just include everything reasonable?

Optimize and benchmark for NUMA

This requires beefier machines, but I guess the idea is to do Netflix style and route requests to the node where the data is.

That, or let network routing take care of it, and give each node a complete copy of the data. Netflix doesn’t do this because they don’t want to waste IPv4 addresses, but for this pet project I don’t have to care about IPv4.

Graph performance of concurrent requests

Probably I’d need some tweaking to fix some inefficiencies in my use of epoll. And increase the socket limits.

Tangent: Benchmarking over loopback

To my surprise just using writev() was about twice as fast as using writev()+sendfile() when testing locally, for large files. And slightly faster even with small files.

But then I saw that over loopback (127.0.0.1) the writev() (and sendfile() will have no problems sending a whole GB in one call. Basically there’s a kernel shortcut somewhere for localhost communication.

writev(7, [{iov_base="HTTP/1.1 200 OK\r\n", iov_len=17}, {iov_base="Content-Length: 1000000000\r\n\r\n", iov_len=30}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1000000000}], 3) = 1000000047

So always benchmark over an actual network, to get useful results.

Tangent: building on a Pi

In order to get a compiler with more modern C++ I had to build from source. Not exactly a quick thing to do. Over 14 and a half hours on a Raspberry Pi 4.

real    878m54.181s
user    778m17.517s
sys     32m56.963s

I built in without concurrency because I’ve made the mistake before of running out of RAM building GCC before, and it’s not a happy story.

Viewing all 112 articles
Browse latest View live