Quantcast
Channel: Blargh
Viewing all 112 articles
Browse latest View live

Meshtastic quick setup

$
0
0

I wanted some nice offline mid range chat app, for when I don’t have data, or data roaming is too expensive. I also want it to work for people who are not amateur radio licensed, since my girlfriend stubbornly refuses to be interested in that.

Looks like the answer I’m looking for is Meshtastic, preferably with LoRalora]. I bought a couple of Heltec V3 ESP32 LoRa OLED and the matching case.

Maybe I’ll buy a battery, but I’m fine just powering it from a USB power bank.

The documentation makes a fair bit of assumptions about the user knowing the name for what they want, and what firmware provides what.

In short, what I think I want is to ignore the Heltec firmware, and instead just treat the Heltec V3 as the hardware that Meshtastic runs on.

The recommended way to flash, and for some cases even use, is the Meshtastic Web UI. It uses browser integration for serial ports and bluetooth. A nice idea, but it was extremely unreliable for me. The flasher worked for one device, but not the other. The chat client never worked at all.

Here’s what worked reliably for me:

  1. Download “stable” firmware from https://meshtastic.org/downloads
  2. Plug in the Heltec V3.
  3. Unzip it.
  4. Follow flash HOWTO, or tl;dr:
    ./device-install.sh -f firmware-heltec-v3-2.2.16.1c6acfd.bin
    
  5. Install Meshtastic android app.
  6. Bang on the app until it submits.

Yay, messages sent both ways. I’ve not tried any range test yet.

So this is just a “skip to the end” HOWTO, to not waste time for either future me or someone else.

PS: I hate the U.FL connector. It’s so fragile, and I’m told it’s not even specced for unplugging, so if it breaks when you unplug it, that’s working as intended.


Use AGW for packet radio applications

$
0
0

When creating packet radio applications, there are several options on how to get the packets “out there”, and get them back. That is, how to interface with the modem.

Sure, you can write your own modem, and have the interface to the outside world be plain audio and PTT (push to talk, i.e. trigger transmit). But now you’re writing a modem, not an application. You should probably split the two, and have an interface between them.

KISS

You can use KISS, but it’s very limited. You can only send individual packets, so it’s only really good for sending unconnected (think UDP) packets like APRS. It’s not good for querying metadata, such as port information and outstanding transmit queue.

Think of KISS like a lower layer that applications shouldn’t think about. Like ethernet. Sure, as a good engineer you should know about KISS, but it’s not what your application should be interfacing with.

Linux kernel implementation

On Linux you can use AF_AX25 sockets, and program exactly like you do for regular internet/IP programs. SOCK_DGRAM for UI frames (UDP-like), and SOCK_STREAM for connected mode (TCP-like).

But the Linux kernel implementation is way too buggy. SOCK_STREAM works kinda OK, but does not handle all cases well. E.g. I don’t know if my patch making it possible to call write() while a read() is pending made it into the kernel. SOCK_DGRAM is just plain broken.

Sure, the kernel could be fixed. There’s a 180k EUR bounty on fixing it, but to my knowledge nothing is happening with that. And even if it does get fixed, I’m not a fan of the approach as a whole.

Any improvements to the kernel take years to actually get to users. And any added bugs will also take years to be noticed, and therefore we should expect multi-year breakages regularly.

This isn’t the late 1900s anymore, where we scripted downloading the latest kernel version as soon as Linus put it on an FTP server. It’s not realistic to expect people to even apply targeted patches, much less make the latest kernel work with their ZFS out of tree module, and some other vendor patches, like for Raspberry Pi.

I know what I’m doing, and I still don’t want to recompile the kernel on every machine on which I want to play with packet radio.

Also, at this point the kernel implementation is more likely to be removed, than fixed.

AGW

AGW, or AGWPE, is a slightly higher level protocol. The documentation and general usage is a bit unclear about if AGWPE refers to the protocol or the reference implementation, so naming is a bit inconsistent out there in the world. I’ll use AGW to refer to the protocol, for now.

AGW is way more capable than KISS. It supports connected mode, querying for ports, and other things. It’s an async protocol, and by the standards of amateur radio (and the time it was designed) it’s not too bad.

It has some reserved fields, and is not super extensible for the future, but it’s also pretty complete. In any case it is what a huge set of programs support, when they support something other than the broken kernel API.

As an example, Direwolf supports talking AGW over TCP, as does soundmodem. So this seems to be the best interface available today to code applications against.

I’ve not tried it yet, but I think that because direwolf supports both KISS and AGW, I should be able to use the KISS interface to go at 9600bps using a Kenwood TH-D74/D75.

Not that connected mode AX.25 scales linearly from 1200bps to 9600bps, as WB2OSZ has described, but for bulk transfers that’s another project of mine.

Rust API for the AGW protocol

I made a Rust library for using AGW, code here. Example code is a simple curses-based terminal.

Unrelated

Today’s surprise was watching a youtube video teaching people about amateur radio stuff and encountering this, red circle and all, from this video on Modern Introduction to Packet Radio - APRS BBS TCP/IP AX25 and NPR.

Unrelated photo

Cross compiling Rust to Ubiquiti access point

$
0
0
SOLVED: I should have put the linker in .cargo/config.toml, not Cargo.toml. See followup blog post.

This is not the right way to do it, as will become abundantly clear. But it works.

Set up build environment

rustup toolchain install nightly
rustup component add rust-src --toolchain nightly
apt install {binutils,gcc}-mips-linux-gnu

Create test project

cargo new foo
cd foo

Build most of it

This will build for a while, then fail.

cargo +nightly build --release -Zbuild-std --target mips-unknown-linux-gnu

For some reason it’s trying to use cc to link. I tried putting this in Cargo.toml, but it does nothing:

[target.mips-unknown-linux-gnu]
linker = "mips-linux-gnu-gcc"

But I found a workaround.

Temporarily change /usr/bin/cc to point to the mips gcc

It does not work if you do this before the previous step.

PREV="$(readlink -v /usr/bin/cc)"
sudo rm /usr/bin/cc
sudo ln -s /usr/bin/mips-linux-gnu-gcc /usr/bin/cc

Same command again

cargo +nightly build --release -Zbuild-std --target mips-unknown-linux-gnu

It should succeed. Yay.

Restore /usr/bin/cc

sudo rm /usr/bin/cc
sudo ln -s "${PREV?}" /usr/bin/cc

Change the “interpreter” to what the Ubiquiti system expects

cd target/mips-unknown-linux-gnu/release
patchelf --remove-needed ld.so.1 foo
patchelf --set-interpreter /lib/ld-musl-mips-sf.so.1 foo

Building it again

Probably easiest to rm -fr target, and go back to the step “Build most of it”.

Does it work?

$ ./foo
Hello, world!

Yay!

  • https://doc.rust-lang.org/rustc/targets/custom.html

Cross compiling Rust -- Fixed

$
0
0

Set up rust build environment

rustup toolchain install nightly
rustup component add rust-src --toolchain nightly
apt install {binutils,gcc}-mips-linux-gnu

Choose glibc or musl as your libc

You can use glibc, but it’s not worked as well for me as using musl. I got some missing symbols when trying this on larger programs with glibc. So I recommend using musl.

mips-unknown-linux-musl is with musl libc, and mips-unknown-linux-gnu is glibc.

Optional: Set up for glibc

I needed to create a special target for glibc, to turn on soft-float. soft-float seems on by default in mips-unknown-linux-musl.

rustc +nightly \
    -Z unstable-options \
    --print target-spec-json \
    --target mips-unknown-linux-gnu \
    > mips-nofloat-linux-gnu.json

Then edit mips-nofloat-linux-gnu.json

  • Change is-builtin to false
  • add +soft-float to features list.

Create test project

cargo new foo
cd foo

Configure linker

mkdir .cargo
cat > .cargo/config.toml
[target.'cfg(target_arch="mips")']
linker = "mips-linux-gnu-gcc"
^D

Build

Using musl (worked better for me):

cargo +nightly build --release -Zbuild-std --target mips-unknown-linux-musl.json

Using glibc (requires the optional step above):

cargo +nightly build --release -Zbuild-std --target mips-nofloat-linux-gnu.json

Change the “interpreter” to what the Ubiquiti system expects

cd target/mips-unknown-linux-musl/release  # or -gnu, if using glibc
patchelf \
    --remove-needed ld.so.1 \
    --set-interpreter /lib/ld-musl-mips-sf.so.1 \
    foo

Does it work?

$ ./foo
Hello, world!

Yay!

  • https://doc.rust-lang.org/rustc/targets/custom.html
  • https://doc.rust-lang.org/cargo/reference/config.html

Rust is faster than C, even before I added SIMD

$
0
0

I found some old C code of mine from around 2001 or so. I vaguely remember trying to make it as optimized as possible. Sure, I was still a teenager, so it’s not state of the art. But it’s not half bad. I vaguely suspect I could do better with better optimization for cache lines, but it’s pretty good.

On my current laptop it does about 12 million passwords per second, single threaded.

Because I’m learning Rust, I decided to port it, and see how fast rust is.

Imagine my surprise when even the first version in Rust was faster. (Yes, I rebuilt the old C code with a modern compiler and its optimizations)

The first Rust version was about 13 million passwords per second.

Why is that? It’s basically the same as the C code. Maybe Rust can take advantage of knowing there’s no pointer aliasing (the reason usually quoted for why Fortran can be faster than C)? Or maybe the memory layout just happened to become more cache friendly?

In any case, I think we can already say that Rust is at least as fast as C.

The code is on github.

SIMD (with Rust)

I realized, of course, that the main performance tool neither my C nor Rust code took advantage of was SIMD. In theory compilers could generate SIMD instructions, but for best performance the algorithm itself should be SIMD aware.

In my case the compiler may be able to do some smaller math operations in one step, but the real gain can only be had by having the main loop check batches of passwords, instead of one at a time.

I’ve never really done this. I am aware of SIMD libraries for C, where you basically call the exact CPU instructions you want. That means your code is neither ready for another architecture (ARM, RISC-V), nor ready for the future (when you upgrade your hardware, and get AVX2 instructions).

Incidentally, this “being ready for the future” is why I’m looking forward to getting RISC-V hardware with vector instructions, since unlike Intel SIMD instructions, they work on variable length vectors. In theory your code can automatically get 2x faster when the CPU just doubles the SIMD register size. Not even a recompile needed.

The second best thing in terms of speed is a programming API that doesn’t assume what instructions will be used. Rust has an experimental std::simd API. You just code with its batch sizes (variable length may come some day), and it magically becomes SIMD.

This brought my zip cracker to 36 million passwords per second. Almost a 3x speedup. And that’s on my first attempt. I don’t even know what batch size is best. I just went with 16 because it sounded reasonable. Maybe another batch size is faster.

What if you CPU doesn’t have SIMD?

Then the experimental Rust SIMD library will generate normal non-SIMD instructions. In theory it should then run the same speed as if you’d not used the SIMD API at all.

But turns out it can be faster anyway. My StarFive VisionFive 2 doesn’t have SIMD or vector instructions, but it became 10-30% faster when using the Rust SIMD API.

Very likely this means that I could have manually tried passwords in batches, without std::simd. But why should I? If the CPU supports it, then extra performance is just a recompile away.

And that’s before multithreading it

My laptop is a Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz. That’s 6 cores, 12 hyperthreads.

With multithreading, and barely any communication between threads, that got me to about 280-290 million passwords per second. An 8x speedup. Not bad, considering that hyperthreads share some execution units, and they all share at least a bit of cache.

John the ripper, a famous password cracker, does about 130 million passwords a second (john --test --format=PKZIP), making mine over 2x faster.

GPUs

I’m sure the fastest password cracking, like all parallelizable number crunching, is the fastest when done on a GPU. But that was not the exercise here today.

General CPU code has a nice quality that it’ll work anywhere, and forever. GPU code, last I checked, is less portable than that.

That’s not to say that I wouldn’t want to port this to GPUs. But I would like that to be in Rust, too.

Update: Hashcat 6.2.6 does 1-2 billion passwords a second on my old GeForce 960, according to hashcat -b -m 17220/17225/17230. So yeah, not even close.

Is your TLS resuming?

$
0
0

There are two main ways that a TLS handshake can go: Full handshake, or resume.

There are two benefits to resumption:

  1. it can save a round trip between the client and server.
  2. it saves CPU cost of a public key operation.

Round trip

Saving a round trip is important for latency. Some websites don’t use a CDN, so a roundtrip could take a while. And even those on a CDN can be tens of milliseconds away. Maybe won’t matter much for a human, but roundtrips can kill the performance of something that needs to do sequential connections.

E.g. Australia is far away:

$ ping -c 1 -n www.treasury.gov.au
PING treasury.gov.au (3.104.80.4) 56(84) bytes of data.
64 bytes from 3.104.80.4: icmp_seq=1 ttl=39 time=369 ms

That’s about a third of a second. Certainly noticeable to a human. Especially since rendering a web page usually requires many connections to different hosts.

For TCP based web requests (in other words: not QUIC), there’s usually four roundtrips involved (slightly simplified):

  1. TCP connection establishment.
  2. ClientHello & ServerHello.
  3. Client & Server ChangeCipherSpec.
  4. HTTP request & response.

So from the UK to Australia, that’s about one full seconds, just setting up the connection. And then another third of a second for the request.

So that could be better.

CPU cost of public key operation

This can be a problem even if latency is low. Public key operations can take a lot of CPU. Especially because as key sizes grow, the cost grows superlinerarly.

$ openssl speed rsa2048 rsa4096
[…]
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000520s 0.000015s   1921.7  65540.4
rsa 4096 bits 0.003473s 0.000054s    287.9  18544.8

Doubling key size makes signatures (server) almost 7x as expensive, and verifications (client) 3.5x as expensive. (on my laptop. YMMV)

At scale, if you’re doing millions of QPS, this can add up. This is only 287 handshakes per second, per CPU. Not Web Scale™.

TLS before 1.3, and resumptions

The ClientHello sent by the client will contain a session ticket or session ID (the difference doesn’t matter here), basically saying “Hey, we’ve talked before. How about we skip some negotiation, and get right back to where we were?”. If the Server agrees, then that skips roundtrip 3.

This takes www.treasury.gov.au down to ~1s for any followup (resumed) connections, but it still takes 1.2s for the first request.

TLS 1.3 and round trips

TLS 1.3 shaves off a roundtrip to the initial request by optimistically making assumptions about what the server supports. That way it can say “Hello”, introducing which ciphers it supports, and select one, assuming that the server does, too.

In the normal case, this merges roundtrips 2&3, and now it’s only three roundtrips.

In our Australia example, this get the initial request down to ~1s as well, (except www.treasury.gov.au doesn’t support TLS 1.3).

TLS 1.3 and resumptions

Alright, so with resumptions, TLS brings us down to just two roundtrips, right? ~600ms total request time to Australia!

Not so fast. (heh)

The handshake is now:

  1. TCP connecting.
  2. Hello & optimistically choose cipher parameters.
  3. HTTP GET.

Which one would resuming remove? They’re all needed, and you can’t anything before you get a reply from the previous… or can you?

In the general case, no you can’t. But if you’re willing to sacrifice a specific bit of security, there’s something we can do. Specifically if the server is fine with being vulnerable to a session being replayed, then we send the GET without waiting for the handshake.

From the server’s point of view, it now sees “Hello, you and I are resuming, so please use the previous key, and here’s the encrypted payload”. If all looks good then the server can process the request, and return the reply. TLS calls this 0RTT, in that after the the TCP connection has been established, there’s only… uh… one RTT left. So 0RTT as in no additional roundtrip?

Sending the data early, like this, is called “early data”, and is new with TLS 1.3. It can’t be used on initial requests, since no session key has been negotiated yet.

TLS 1.3 early data and resumptions

Unlike previous versions, TLS 1.3 only saves a roundtrip if it has early data. An HTTP POST cannot be used for early data (because of the replay problem), so is the same number of roundtrips whether resumed or not.

So who supports the perfect setup?

For my Australia example I’m actually struggling a bit to find an Australian service that can fully optimize for latency.

  • www.treasury.gov.au doesn’t support TLS 1.3
  • queenslandtech.com.au doesn’t support early data
  • All others I’m randomly guessing use a CDN.
  • I’m too lazy to set up a VM there just to make the numbers bigger.

Oh well, we’re going to have to pretend, using a more nearby server. That means we’ll deal with tens of milliseconds instead of hundreds, though.

Measuring TLS handshakes times

I made a tool: tlshake.

TLS 1.3 with a request in early data

With a GET request in early data, TLS 1.3 resumes, and saves us a roundtrip. Note that the handshake time is approximately the same on both requests, but total time is very much shorter on the resumed request, since TLS and HTTP used the same roundtrip.

$ tlshake --http-get / www.example.com
Connection: initial
  Target:           www.example.com
  Endpoint:         www.example.com:443
  Connect time:     67.414ms
  Handshake time:   42.208ms
  Handshake kind:   Full
  Protocol version: TLSv1_3
  Cipher suite:     TLS13_AES_256_GCM_SHA384
  ALPN protocol:    None
  Early data:       Not attempted
  Request time:     95.782ms
  Reply first line: HTTP/1.1 200 OK
  Total time:       205.450ms

Connection: resume
  Target:           www.example.com
  Endpoint:         www.example.com:443
  Connect time:     57.422ms
  Handshake time:   39.769ms
  Handshake kind:   Resumed
  Protocol version: TLSv1_3
  Cipher suite:     TLS13_AES_256_GCM_SHA384
  ALPN protocol:    None
  Early data:       accepted
  Request time:     62.583ms
  Reply first line: HTTP/1.1 200 OK
  Total time:       159.804ms

Resumption without early data

We can resume the session without using early data. That saves the CPU usage for the handshake, but won’t save any roundtrip. Notice the very similar total time in this case:

$ tlshake --http-get / --disable-early-data www.example.com
Connection: initial
  Target:           www.example.com
  Endpoint:         www.example.com:443
  Connect time:     68.945ms
  Handshake time:   41.140ms
  Handshake kind:   Full
  Protocol version: TLSv1_3
  Cipher suite:     TLS13_AES_256_GCM_SHA384
  ALPN protocol:    None
  Early data:       Not attempted
  Request time:     89.317ms
  Reply first line: HTTP/1.1 200 OK
  Total time:       199.437ms

Connection: resume
  Target:           www.example.com
  Endpoint:         www.example.com:443
  Connect time:     66.183ms
  Handshake time:   39.028ms
  Handshake kind:   Resumed
  Protocol version: TLSv1_3
  Cipher suite:     TLS13_AES_256_GCM_SHA384
  ALPN protocol:    None
  Early data:       Not attempted
  Request time:     90.413ms
  Reply first line: HTTP/1.1 200 OK
  Total time:       195.653ms

Only the TLS handshake

If we only handshake, without sending a request, then it won’t try to resume at all (note that both requests are of the Full kind).

$ tlshake www.example.com
Connection: initial
  Target:           www.example.com
  Endpoint:         www.example.com:443
  Connect time:     66.218ms
  Handshake time:   43.683ms
  Handshake kind:   Full
  Protocol version: TLSv1_3
  Cipher suite:     TLS13_AES_256_GCM_SHA384
  ALPN protocol:    None
  Early data:       Not attempted
  Total time:       109.930ms

Connection: resume
  Target:           www.example.com
  Endpoint:         www.example.com:443
  Connect time:     66.416ms
  Handshake time:   41.626ms
  Handshake kind:   Full
  Protocol version: TLSv1_3
  Cipher suite:     TLS13_AES_256_GCM_SHA384
  ALPN protocol:    None
  Early data:       Not attempted
  Total time:       108.068ms

This seems to be because the server doesn’t provide the session ticket until it receives a request. It’s pretty clever. If a TLS session has been established, but no request was received, then the client should probably not try to reuse it. There’s a good chance that something went wrong, so we should redo the full handshake next time.

For comparison: TLS 1.2-only server

With TLS 1.2 a resumed handshake has one less roundtrip.

$ tlshake --http-get / www.treasury.gov.au
Connection: initial
  Target:           www.treasury.gov.au
  Endpoint:         www.treasury.gov.au:443
  Connect time:     330.905ms
  Handshake time:   617.084ms
  Handshake kind:   Full
  Protocol version: TLSv1_2
  Cipher suite:     TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  ALPN protocol:    None
  Early data:       Not attempted
  Request time:     279.625ms
  Reply first line: HTTP/1.1 301 Moved Permanently
  Total time:       1227.643ms

Connection: resume
  Target:           www.treasury.gov.au
  Endpoint:         www.treasury.gov.au:443
  Connect time:     332.880ms
  Handshake time:   306.455ms
  Handshake kind:   Resumed
  Protocol version: TLSv1_2
  Cipher suite:     TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  ALPN protocol:    None
  Early data:       Not attempted
  Request time:     308.770ms
  Reply first line: HTTP/1.1 301 Moved Permanently
  Total time:       948.132ms

Play around with TLS handshakes for your own service

Check out tlshake, and see if your TLS resumptions work as you expect, and with the expected number of roundtrips.

More info

An AX.25 implementation in Rust

$
0
0

After having written a user space AX.25 stack in C++, I got bitten by the Rust bug. So this is the third time I’ve written an AX.25 stack, and I’ve become exceedingly efficient at it.

Here it is:

The reason for a user space stack remains from last time, but this time:

  1. It’s written in Rust. Yay! I know people say Rust has a honeymoon period, but I guess that’s where I am, still.
  2. It’s a normal library first. The previous C++ implementation started off as microservices, which in retrospect was needlessly complex and put the cart before the horse.

I’ve added almost an excessive amount of comments to the code, to cross reference with the specs. The specs that have a few bugs, by the way.

Rust

I’m not an expert in Rust, but it allows for so much more confidence in your code than any other language I’ve tried.

I think I know enough Rust to know what I don’t fully know. Sure, I’ve successfully added lifetime annotations, created macros, and built async code, but I’m not fluent in those yet.

Interestingly, I’ve so far managed to not need any type annotations or macros in rax25. It’s starting to feel a bit like template metaprogramming in C++: you can code for weeks not needing them, but they are a dynamic puzzle piece that can make your whole design fall into place.

What with waiting for multiple timers, incoming packets, and user instructions such as “write data” or “disconnect”, I think async should make for the best API for this in the end. But so far I just made a (pretty bad) sync API.

Microservices

For amateur radio packet radio there are a few components you’ll need to have. And they need to talk to each other.

You’ll need:

  • A modem (e.g. Bell 202, G3RUH, VARA, VARA FM)
  • A connection stack, let’s call it (e.g. AX.25, VARA, or TCP/IP on top of AX.25 UI frames)
    • For AX.25 there’s Direwolf and the Linux kernel implementation.
  • An application (e.g. axcall, Winlink, axsh)

You’ll also need these pieces to talk to each other:

  • KISS is the de facto standard for talking to a modem. Direwolf, Linux, and even VARA support it.
  • AGW is a protocol for talking to a connection stack. I made an AGW library to be able to use the one in Direwolf.
  • The Linux kernel has a socket interface for the connection stack.

The problem is that basically all of these things suck.

The functionality

  • Bell 202 and even G3RUH are far from state of the art modems.
  • AX.25 doesn’t have FEC.
    • IL2P improves things a bit over standard AX.25, though.
  • VARA is closed source and closed spec.
  • The Linux kernel AX.25 stack is very incomplete, and buggy. And even when it does get fixed it takes years before you’ll get the fixed kernel. Then they break it again, and you’ll have to wait years for the fix to roll out.
    • Some distributions don’t even compile it in, so your kernel may not have AX.25 support, and with things like Raspberry Pi and ZFS patches, you may not want to keep patching and building your kernel. It’s also a hard sell on other people that in order to run your application, they must first recompile their kernel to a supported version.
  • Direwolf is fairly fine. I just have a couple of problems with it, and they’re probably all fixable:
    • You always have to run it with -t 0, or your eyes will get cancer.
    • If you only want to use it for its connection stack, and not its modem, then it seems to take 100% CPU. I think there’s some busyloop reading from /dev/zero or something if you specify ADEVICE null null.
    • It’s too much of an all in one stack. It makes it harder to experiment with different modems but the same connection stack.

The interfaces

  • AX.25 connection stack doesn’t have what we on the internet would call “ports”, but a connection is just between one call+SSID to another. Basically every callsign only have 16 ports, that all devices using that callsign have to share. Because if IGates it’s probably not even a good idea to reuse call+SSID even on different frequencies.
  • KISS doesn’t allow for:
    • Reporting SNR/RSSI.
    • Changing modem parameters according to radio conditions.
    • Interfacing with carrier sense.
    • Queue management.
  • AGW is a pretty terrible protocol. Ok, it gets the job done, but it’s just… eww.

So basically it’d be great if we could redesign the interfaces between these functionalities, so that we can then basically rewrite them all. Well, as implementations go, Direwolf would just need to add these interfaces.

I made the mistake with the C++ stack of jumping the gun on these interfaces, instead of starting from an implementation.

How usable is my new crate rax25?

Well, it works. Its API is not great, but works well enough to connect to my local BBS GB7CIP, and seems to interoperate well with the Linux kernel implementation when I test it locally.

It just needs a modem behind a KISS interface, like Direwolf or the Kenwood TH-D75, and you can run it with:

$ cargo build --example client
$ ./target/debug/examples/client -r -p /dev/rfcomm0 -s M0XXX-1 GB7CIP-7

But it still has many TODOs. E.g. if its send queue becomes full, it’s unable to buffer that and just exits, instead. I expect to work on these over the next few weeks. But pull requests are welcome.

Connection coalescing breaks the Internet

$
0
0

Connection coalescing is the dumbest idea to ever reach RFC status. I can’t believe nobody stopped it before it got this far.

It breaks everything.

Thus starts my latest opinion post.

What is connection coalescing?

It’s specified in the RFC for HTTP/2 as connection reuse, but tl;dr: If the IP address of host A and B overlap, and host A presents a TLS cert that also includes B (via explicit CN/SAN or wildcard cert), then the client is allowed to send HTTP requests directed to B on the connection that was established to A.

Why did they do that?

To save roundtrips and TLS handshakes. It seems like a good idea if you don’t think about it too much.

Why does it break everything?

I’ll resist just yelling “layering violation”, because that’s not helpful. Instead I’ll be more concrete.

Performing connection coalescing is a client side (e.g. browser) decision. But it implicitly mandates a very strict server architecture. It assumes that ALL affected hostnames are configured exactly the same in many regards, and indeed that the HTTP server even has the config for all hostnames.

Concrete things that this breaks:

  1. The server can’t have a freestanding TLS termination layer, that routes to HTTP servers based on SNI.
  2. The HTTP server can’t reference count HTTP config fragments, since requests can come in for anything.
  3. Hosts with stricter TLS config and/or mTLS cannot prevent the client from leaking headers into a less secure connection by inadvertent request smuggling. Good luck not logging secrets, while still detecting it properly.

I’m sure there are more ways that it breaks everything. It commits all servers everywhere forever to be locked in to how it works. Countless possible architectures can never be, because connection coalescing has already committed all servers into a very specific implementation.

Did the RFC not consider this?

Not really. It has a handwavy “oh the server can(!) send HTTP 421, and the client is then allowed to retry the request on a fresh connection”.

But how is the server even supposed to know? This forces a HUGE restriction on the server even detecting this happening.

And it’s too late! The secret requests have already been leaked to the wrong server!

Not to mention that some clients don’t implement handling 421, even if it were possible for the server to detect the situation. Which it can’t, in the general case.

So what do I do?

For any nontrivial server setup, you should probably:

  1. Reject all requests on a connection that don’t match the first request. And “hope” that SNI matches the first request. Or better yet, verify SNI against Host header.
  2. Don’t put more than one FQDN in your TLS certs, and definitely don’t use wildcard certs. “Hope” that you catch all cases.
  3. Always use separate IP addresses per hostname. Like in the pre-SNI 1900’s. Again “hope” that you catch all cases.

And obviously hope is not a strategy.

Nobody does any of these workarouds. The Internet (well, the web) will be broken forever.

How do I even “reject”? To 421 or not to 421…

Let’s pretend that you can detect all cases of misrouted requests. What do you do? The spec allows you to return 421. But it’s a free Internet, you can do whatever you want.

If you return 421 then some clients will handle this correctly. Others will have not implemented 421 handling (it’s not mandatory), and will break in some other way.

(but remember. It’s already too late. The client has already sent you the secret request that may contain PII)

Arguably you should return some 5xx code, so that you can more easily detect when you’ve screwed up with your certs or other SNI routing. This assumes that you monitor for 500s, in some way. Basically the logic is that it’s better to work 0% of the time than 98% of the time, since you’ll be sure to fix the former, but won’t even know why some people keep complaining when it happens to work just fine for you.

The RFC says “[421] MUST NOT be generated by proxies”. Presumably this only means forward proxies?

“A 421 response is cacheable by default”. What does a cached 421 even mean? A 421 is a layering violation. You might as well say that a TCP SYN is cachable.

Summary

Connection coalescing considered dumb and harmful.

Further reading

  • https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/
  • https://blog.cloudflare.com/connection-coalescing-experiments/

Pike is wrong on bloat

$
0
0

This is my response to Rob Pike’s words On Bloat.

I’m not surprised to see this from Pike. He’s a NIH extremist. And yes, in this aspect he’s my spirit animal when coding for fun. I’ll avoid using a framework or a dependency because it’s not the way that I would have done it, and it doesn’t do it quite right… for me.

And he correctly recognizes the technical debt that an added dependency involves.

But I would say that he has two big blind spots.

  1. He doesn’t recognize that not using the available dependency is also adding huge technical debt. Every line of code you write is code that you have to maintain, forever.

  2. The option for most software isn’t “use the dependency” vs “implement it yourself”. It’s “use the dependency” vs “don’t do it at all”. If the latter means adding 10 human years to the product, then most of the time the trade-off makes it not worth doing at all.

He shows a dependency graph of Kubernetes. Great. So are you going to write your own Kubernetes now?

Pike is a good enough coder that he can write his own editor (wikipedia: “Pike has written many text editors”). So am I. I don’t need dependencies to satisfy my own requirements.

But it’s quite different if you need to make a website that suddenly needs ADA support, and now the EU forces a certain cookie behavior, and designers (in collaboration with lawyers) mandate a certain layout of the cookie consent screen, and the third party ad network requires some integration.

What are you going to do? Demand funding for 100 SWE years to implement it yourself? And in the mean time, just not be able to advertise during BFCM? Not launch the product for 10 years? Just live with the fact that no customer can reach your site if they use Opera on mobile?

I feel like Pike is saying “yours is the slowest website that I ever regularly use”, to which the answer is “yeah, but you do use it regularly”. If the site hadn’t launched, then you wouldn’t be able to even choose to use it.

And comparing to the 70s. Please. Come on. If you ask a “modern coder” to solve a “1970s problem”, it’s not going to be slow, is it? They could write it in Python and it wouldn’t even be a remotely fair fight.

Software is slower today not because the problems are more complex in terms of compute (yet they very very very much are), but because the compute capacity of today simply affords wasting it, in order that we are now able to solve complex problems.

People do things because there’s a perceived demand for it. If the demand is “I just like coding”, then as long as you keep coding there’s no failure.

Pike’s technical legacy has very visible scars from these blind spots of his.

Rebuilding FRR with pim6d

$
0
0

Short post today.

Turns out that Debian, in its infinite wisdom, disables pim6d in frr. Here’s a short howto on how to build it fixed.

$ sudo apt build-dep frr
[…]
$ apt source frr
[…]
$ cd frr-8*
$ DEB_BUILD_PROFILES=pkg.frr.pim6d dpkg-buildpackage -us -uc -b
$ sudo dpkg -i ../frr_*.deb

Then you can enable pim6d in /etc/frr/daemons and restart frr.

Not that I managed to get IPv6 multicast routing to to work over wireguard interfaces anyway. Not sure what’s wrong. Though it didn’t fix it, here’s an interesting command that made stuff like ip -6 mroute look like it should work:

$ sudo smcroutectl  add LAN ff38:40:fd11:222:3333:44:0:1122 wg-foo

Exploring RISC-V vector instructions

$
0
0

It finally happened! A raspberry pi like device, with a RISC-V CPU supporting the v extension. Aka RVV. Aka vector instructions.

I bought one, and explored it a bit.

SIMD background

First some background on SIMD.

SIMD is a set of instructions allowing you to do the same operation to multiple independent pieces of data. As an example, say you had four 8 bit integers, and you wanted to multiply them all by 2, then add 1. You could do this with a single operation without any special instructions.

    # x86 example assembly.

    mov eax, [myvalues]  # load our four bytes.
    mov ebx, 2           # we want to multiply by two
    imul eax, ebx        # single operation, multiple data!
                         # After this, eax contains 0x02040608
    add eax, 0x01010101  # single operation, multiple data!
                         # After this, eax contains 0x03050709
    mov [myvalues], eax  # store back the new value.

section .data
  myvalues db 1,2,3,4

Success, right? No, of course not. This naive code doesn’t handle over/underflow, and doesn’t even remotely work for floating point data. For that, we need special SIMD instructions.

x86 and ARM have gone the way of fixed sized registers. In 1997 Intel introduced MMX, to great fanfare. The PR went all “it’s multimedia!”. “Multimedia” was a buzzword at the time. This first generation gave you a whopping 64 bit register size, that you could use for one 64-bit value, two 32-bit values, four 16-bit values, or 8 8-bit values.

A “batch size” of 64 bit, if you prefer.

These new registers got a new set of instructions, to do these SIMD operations. I’m not going to learn the original MMX instructions, but it should look something like this:

  movq mm0, [myvalues]  # load values
  movq mm1, [addconst]  # load our const addition values.
  paddb mm0, mm0        # add to itself means multiply by 2
  paddb mm0, mm1        # Add vector of ones.
  movq [myvalues], mm0  # store the updated value.
  emms                  # state reset.

section .data
  myvalues db 1,2,3,4
  addconst db 1,1,1,1

So far so good.

The problem with SIMD

The problem with SIMD is that it’s so rigid. With MMX, the registers are 64 bits. No more, no less. Intel followed up with SSE, adding floating point support and doubling the register size to 128 bits. That’s four 32-bit floats in one xmm register.

So now we have 64-bit mm registers, 128-bit xmm registers, and uncountably many instructions to work with these two sets of new registers.

Then we got SSE2, SSE3, SSE4. Then AVX, AVX2 (256 bit registers), and even AVX-512 (512 bit registers).

512 bit registers. Not bad. You can do 16 32-bit floating point operations per instruction with that.

But here’s the problem: Only if your code was written to be aware of these new registers and instructions! If your production environment uses the best of the best, with AVX-512, you still can’t use that, if your development/QA environment only has AVX2. Not without having separate binaries.

Or you could maintain 20 different compiled versions, and dynamically choose between them. That’s what volk does. Or you could compile with -march=native (gcc) or -Ctarget-cpu=native (Rust), and create binaries that only work on machines at least as new as the one you built on.

But none of these options allow you to build a binary that will automatically take advantage of future processor improvements.

Vector instructions do.

Vector instructions

Instead of working with fixed sized batches, vector instructions let you specify the size of the data, and the CPU will do as many at once as it can.

Here’s an example:

  # Before we enter the loop, input registers are set thusly:
  # a0: number of elements to process.
  # a1: pointer to input elements.
  # a2: pointer to output elements.

  # Load 2.0 once. We'll need it later.
  flw ft0, two
loop:
  # prepare the CPU to process a batch:
  # * a0: of *up to* a0 elements,
  # * e32: each element is 32 bits. The spec calls this SEW.
  # * m1: group registers in size of 1 (I'll get to this). LMUL in the spec.
  # * ta & ma: ignore these flags, they're deep details.
  #
  # t0 will be set to the number of elements in a "batch"
  vsetvli t0, a0, e32, m1, ta, ma

  # Set t1 to be the number of bytes per "batch".
  # t1 = t0 << 2
  slli t1, t0, 2

  # Load a batch.
  vle32.v v0, (a0)

  # Multiply them all by 2.0.
  vfmul.vf v0, v0, ft0

  # Store them in the output buffer.
  vse32.v v0, (a2)

  # Update pointers and element counters.
  add a1, a1, t1
  add a2, a2, t1
  sub a0, a0, t0

  # Loop until a0 is zero.
  bnez a0, loop

two:    .float 2.0

Write once, and when vector registers get bigger, your code will automatically perform more multiplies per batch. That’s great! You can use an old and slow RISC-V CPU for development, but then when you get your big beefy machine the code is ready to go full speed.

The RISC-V vector spec allows for vector registers up to 64Kib=8KiB, or 2048 32-bit floats. And with m8 (see below), that allows e.g. 16384 32-bit floats being multiplied by 16384 other floats, and then added added to yet more 16384 floats, in a single fused multiply-add instruction.

Even more batching

RISC-V has 32 vector registers. On a particular CPU, each register will be the same fixed size, called VLEN. But the instruction set allows us to group the registers, creating mega-registers. That’s what the m1 in setvli is about.

If we use m8 instead of m1, that gives you just four vector registers: v0, v8, v16, and v24. But in return they are 8 times as wide.

The spec calls this batching number LMUL.

Basically a pairwise floating point vector multiplication vfmul.vf v0, v0, v8 in m8 mode effectively represents:

   vfmul.vf v0, v0, v8
   vfmul.vf v1, v1, v9
   vfmul.vf v2, v2, v10
   vfmul.vf v3, v3, v11
   vfmul.vf v4, v4, v12
   vfmul.vf v5, v5, v13
   vfmul.vf v6, v6, v14
   vfmul.vf v7, v7, v15

Bigger batching, at the cost of fewer registers. I couldn’t come up with a nice way to multiply two complex numbers with only four registers. Maybe you can? If so, please send a PR adding a mul_cvec_asm_m8_segment function to my repo. Until then, the m4 version is the biggest batch. m8 may still not be faster, since the m2 version of mul_cvec_asm_mX_segment is a little bit faster than the m4 version in my test.

Like with SIMD, there are convenient vector instructions for loading and handling data. For example, if you have a vector of complex floats, then you probably have real and imaginary values alternating. vlseg2e32.v v0, a0 will then load the real values in v0, and the imaginary values in v1.

Or, if vsetvli was called with m4, the real values will be in v0 through v3, and imaginary values in v4 through v7.

Curiously a “stride” load (vlse32.v v0, (a0), t1), where you can do things like “load every second float”, seems to not have very good performance. Maybe this is specific to the CPU I’m using. I would have expected the L1 cache to make them fairly equal, but apparently not.

So yes, it’s not perfect. On a future CPU it may be cheap to load from L1 cache, so your code should be more wasteful about vector registers, to be optimal. Maybe on a future CPU the stride load is faster than the segmented load. There’s no way to know.

The Orange Pi RV2

It seems that the CPU, a Ky X1, isn’t known to llvm yet. So you have to manually enable the v extension when compiling. But that’s fine.

$ cat ~/.cargo/config.toml
[target.riscv64gc-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native", "-Ctarget-feature=+v"]

I filed a bug with Rust about it, but it seems it may be a missing LLVM feature. It’s apparently not merely checking /proc/cpuinfo for the features in question, but needs the name of the CPU in code or something.

It seems that the vector registers (VLEN) on this hardware are 256 bit wide. This means that with m8 a single multiplication instruction can do 8*256/32=64 32-bit floating point operations. Multiplying two vector registers in one instructions multiplies half a kibibyte (256 bytes per aggregate register).

64 32bit operations is a lot. We started off in SIMD with just 2. And as I say above, when future hardware gets bigger vector registers, you won’t even have to recompile.

That’s not to say that the Orange Pi RV2 is some sort of supercomputer. It’s much faster than the VisionFive 2, but your laptop is much faster still.

So how much faster is it?

I started a Rust crate to test this out.

$ cargo +nightly bench --target  target-riscv64-no-vector.json -Zbuild-std
[…]
running 10 tests
test bench_mul_cvec_asm_m2_segment ... bench:       2,930.20 ns/iter (+/- 29.96)
test bench_mul_cvec_asm_m4_segment ... bench:       3,036.18 ns/iter (+/- 100.35)
test bench_mul_cvec_asm_m4_stride  ... bench:       4,713.20 ns/iter (+/- 55.09)
test bench_mul_cvec_asm_m8_stride  ... bench:       5,368.08 ns/iter (+/- 15.18)
test bench_mul_cvec_rust           ... bench:       9,957.66 ns/iter (+/- 76.39)
test bench_mul_cvec_rust_v         ... bench:       3,020.23 ns/iter (+/- 21.72)
test bench_mul_fvec_asm_m4         ... bench:         843.94 ns/iter (+/- 22.86)
test bench_mul_fvec_asm_m8         ... bench:         801.36 ns/iter (+/- 27.71)
test bench_mul_fvec_rust           ... bench:       4,097.09 ns/iter (+/- 29.77)
test bench_mul_fvec_rust_v         ... bench:       1,084.47 ns/iter (+/- 13.46)

The “rust” version is normal Rust code. The _v code is Rust code where the compiler is allowed to use vector instructions. As you can see, Rust (well, LLVM) is already pretty good. But my hand coded vector assembly is faster still.

As you can see, the answer for this small benchmark is “about 3-5 times faster”. That multiplier probably goes up the more operations that you do. These benchmarks just do a couple of multiplies.

Note that I’m cheating a bit with the manual assembly. It assumes that the input is an even multiple of the batch size.

Custom target with Rust

To experiment with target settings, I modified the target spec. This is not necessary for normal code (you can just force-enable the v extension, per above), but could be interesting to know. In my case I’m actually using it to turn vectorization off by default, since while Rust lets you enable a target feature per function, it doesn’t let you disable it per function.

rustup install nightly
rustup default nightly
rustc  -Zunstable-options \
  --print target-spec-json \
  --target=riscv64gc-unknown-linux-gnu \
  > mytarget.json
rustup default stable
# edit mytarget.json
rustup component add --toolchain nightly rust-src
cargo +nightly bench --target mytarget.json -Zbuild-std

Did anything not“just work”?

I have found that the documentation for the RISC-V vector instructions are a bit lacking, to say the least. I’m used to reading specs, but this one is a bit extreme.

vlseg2e32.v v0, (a0) fails with illegal instructions when in m8 mode. That’s strange. Works fine in m1 through m4.

Can we check the specs on the CPU? Doesn’t look like it. It’s a “Ky X1”. That’s all we’re told. Is it truly RVV 1.0? Don’t know. Who even makes it? Don’t know. I see guesses and assertions that it may be by SpacemiT, but they don’t list it on their website. Maybe it’s a variant of the K1? Could be. The English page doesn’t load, but the Chinese page seems to have similar marketing phrases as Orange Pi RV2 uses for its CPU.

Ah, maybe this is blocked by the spec:

The EMUL setting must be such that EMUL * NFIELDS ≤ 8, otherwise the instruction encoding is reserved. […] This constraint makes this total no larger than 1/4 of the architectural register file, and the same as for regular operations with EMUL=8.

So for some reason you can’t load half the register space in a single instruction. Oh well, I guess I have to settle for loading 256 bytes at a time.

It’s a strange requirement, though. The instruction encoding allows it, but it just doesn’t do the obvious thing.

ARM

ARM uses SIMD like Intel. I vaguely remember that it also has vector instructions (SVE?), but I’ve not looked into it.

Can I test this out without hardware?

Yes, qemu has both RISC-V support and support for its vector instructions. I didn’t make precise notes when I tried this out months ago, but this should get you started.

tl;dr

Vector instructions are great. I wasn’t aware of this register grouping, and I love it.

io_uring, kTLS and Rust for zero syscall HTTPS server

$
0
0

Around the turn of the century we started to get a bigger need for high capacity web servers. For example there was the C10k problem paper.

At the time, the kinds of things done to reduce work done per request was pre-forking the web server. This means a request could be handled without an expensive process creation.

Because yes, creating a new process for every request used to be something perfectly normal.

Things did get better. People learned how to create threads, making things more light weight. Then they switched to using poll()/select(), in order to not just spare the process/thread creation, but the whole context switch.

I remember a comment on Kuro5hin from anakata, the creator of both The Pirate Bay and the web server that powered it, along the lines of “I am select() of borg, resistance is futile”, mocking someone for not understanding how to write a scalable web server.

But select()/poll() also doesn’t scale. If you have ten thousand connections, that’s an array of ten thousand integers that need to be sent to the kernel for every single iteration of your request handling loop.

Enter epoll (kqueue on other operating systems, but I’m focusing on Linux here). Now that’s better. The main loop is now:

  set_up_epoll()
  while True:
    new, read, write = epoll()
    epoll_add_connections(new)
    for con in read:
      process(con.read())
      if con.read_all_we_need:
        epoll_remove_read_op(con)
    for con in write:
      con.write_buffer()
      if con.buffer_empty:
        epoll_remove_write_op(con)

All the syscalls are pretty cheap. epoll() only deals in deltas, and it doesn’t have to be re-told the thousands of active connections.

But they’re not without cost. Once we’ve gotten this far, the cost of a syscall is actually a significant part of the total remaining cost.

We’re here going to ignore improvements like sendfile() and splice(), and instead jump to…

io_uring

Instead of performing a syscall for everything we want to do, commanding the kernel to do this or that, io_uring lets us just keep writing orders to a queue, and letting the kernel consume that queue asynchronously.

For example, we can put accept() into the queue. The kernel will pick that up, wait for an incoming connection, and when it arrives it’ll put a “completion” into the completion queue.

The web server can then check the completion queue. If there’s a completion there, it can act on it.

This way the web server can queue up all kinds of operations that were previously “expensive” syscalls by simply writing them to memory. That’s it. And then it’ll read the results from another part of memory. That’s it.

In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue.

Similarly on the kernel side, the kernel will stop busy-looping if there’s nothing new, and needs a syscall to start busylooping again.

This sounds like it would be tricky to optimize, but it’s not. In the end the web server just puts stuff on the queue, and calls a library function that only does that syscall if the kernel actually has stopped busylooping.

This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, strace will show nothing.

One thread per core

Since CPUs today have many cores, ideally you want to run exactly one thread per core, bind it to that core, and not share any read-write data structure.

For NUMA hardware, you also want to make sure that a thread only accesses memory on the local NUMA node. This netflix talk has some interesting stuff on NUMA and high volume HTTP delivery.

The request load will still not be perfectly balanced between the threads (and therefore cores), but I guess fixing that would have to be the topic of a future post.

Memory allocations

We will still have memory allocations though, both on the kernel and web server side. Memory allocations in user space will eventually need syscalls.

For the web server side, you can pre-allocate a fixed chunk for every connection, and then have everything about that connection live there. That way new connections don’t need syscalls, memory doesn’t get fragmented, and you don’t run the risk of running out of memory.

On the kernel side each connection will still need buffers for incoming and outgoing bytes. This may be somewhat controllable via socket options, but again it’ll have to be the subject of a future post.

Try to not run out of RAM. Bad things tend to happen.

kTLS

kTLS is a feature of the Linux kernel where an application can hand off the job of encryption/decryption to the kernel. The application still has to perform the TLS handshake, but after that it can enable kTLS and pretend that it’s all sent in plaintext.

You may say that this doesn’t actually speed anything up, it just moves where encryption was done. But there are gains:

  1. This means that sendfile() can be used, removing the need to copy a bunch of data between user space and kernel space.
  2. If the network card has hardware support for it, the crypto operation may actually be offloaded from the CPU onto the network card, leaving the CPU to do better things.

tarweb

In order to learn these technologies better, I built a web server incorporating all these things.

It’s named tarweb because it’s a web server that serves the content of a single tar file.

Rust, io_uring, and kTLS. Not exactly the most common combination. I found that io_uring and kTLS didn’t play super well together. Enabling kTLS requires three setsockopt() calls, and io_uring doesn’t support setsockopt (until they merge my PR, that is).

And the ktls crate, part of rustls, only allows you to call the synchronous setsockopt(), not export the needed struct for me to pass to my new io_uring setsockopt. Another pr sent.

So with those two PRs merged, it’s working great.

tarweb is far from perfect. The code needs a lot of work, and there’s no guarantee that the TLS library (rustls) doesn’t do memory allocations during handshakes. But it does serve https without even one syscall on a per request basis. And that’s pretty cool.

Benchmarks

I have not done any benchmarks yet. I want to clean the code up first.

io-uring and safety

One thing making io_uring more complex than synchronous syscalls is that any buffer needs to stay in memory until the operation is marked completed by showing up in the completion queue.

For example when submitting a write operation, the memory location of those bytes must not be deallocated or overwritten.

The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.

I feel like I’m back in C++, where any mistake can blow your whole leg off. It’s a miracle that I’ve not seen a segfault.

Someone should make a safer-ring crate or similar, using the powers of pinning and/or borrows or something, to achieve Rust’s normal “if it compiles, then it’s correct”.

Viewing all 112 articles
Browse latest View live