What I did have, it turned out, was an option to embed an interactive 3D model.

Trouble is I’m not a great fan of JavaScript where it’s not essential, and I try to make things which work without needing the user to enable it on random sites they might not have encountered before. I don’t appreciate how a page will load, and be laid out in a particular fashion, but then decide that’s not what it wanted to look like and start moving everything around. It’s slow, flickery, and distracting.

I also don’t think it’s good form to invite a bunch of different sites to come and run code on other people’s computers when they’re just trying to read something I aspire to keeping as a static site. Even though everybody does it.

And I also don’t think it’s good form to open other people’s computers up to tracking cookies and all that business just because they did me the courtesy of visiting my blog.

I do use utteranc.es and MathJax, but the site should still make sense if you choose to block those, or neglect to enable them. I should try harder to not enable them by default, I guess, but whatever… I also use GitHub and Cloudflare.

So anyway. Without resorting to JavaScript, what I’ve done is to create an iframe and embed a bit of inline HTML in it, and that inline HTML is just an image and a link dressed up like a button which you can click to activate the control. The link loads the actual embed in the iframe, but only when you click on it.

The user *could* do these things with their own browser extensions, taking
responsibility for their own privacy, making their own choices, etc., but I
just don’t want to be too deeply implicated in the horrors of the modern web in
that way.

Now, I don’t presume to have any clue how to do these things correctly. It’s web stuff. It’s not my space. But I have, as is the Gold Standard™ in modern software development, cobbled together enough fragments from Stack Overflow to get something that “seems to work”. At least for me. Mostly.

I had trouble with using `{% render %}`

where I thought I should have been able
to, so I reverted to `{% include %}`

which causes its own problems. And there
were other problems, too. So many “why can’t I just do this the easy way like
in the documentation?” moments.

But anyway, here’s the template in all its bumbling glory:

https://github.com/sh1boot/sh1boot.github.io/blob/master/_includes/clickable-embed.liquid

There’s plenty I haven’t bothered to fix. Like where to get the preview images. Mostly I still use the target site to deliver static thumbnails. That allows some measure of nefarious activity, I’m sure, but at least it doesn’t slow everything down with more JavaScript and a heap of noisy back-and-forth.

Also, since the user has already clicked the go button in order to load the embedded content, it’s better to use embed links which auto-start that content. This seems to work for YouTube, but no such luck with Scratch so far.

Here’s what’s then needed to embed a Shadertoy shader:

```
{% include shadertoy.liquid id='clB3RK' %}
```

(using: shadertoy.liquid)

And the result looks like this:

The default here is to get the preview image from that site. That’s probably fine, right?

An embedded Scratch app:

This frustrates me a little because the embed doesn’t contain a link to the project, and also because I haven’t worked out how to get autoplay working. I’m also getting the preview from their site. The platform has its own internal “native” resolution of 480x360, and that’s how it renders the preview, but it’s mostly vector graphics so it scales nicely once it’s loaded.

Except for things drawn with the pen extension. Those stay at 480x360 and get ugly scaling.

Something from Tinkercad:

Tinkercad embeds don’t provide an easy-to-find preview image, so I’ve had to make my own and store them locally. Which is fine because I should really store all previews locally anyway.

And, of course, YouTube (source):

Gets the preview from YouTube by default. This turns out badly for old videos
or 4:3 videos or something. Note that the use of the `youtube-nocookie.com`

domain probably stops this video appearing in your view history. I’ll have to
check that some time.

But the greatest frustration of all is that while I do have the option of putting a misleading preview image in place, I can’t embed a rickroll, because YouTube blocks embedding of every instance of that video that I can find.

- do CSS correctly, or whatever
- turn off other frames when you activate a new one (ie., don’t play five youtube videos at once)
- automatically populate local preview store while the site is being built? (stealing other sites’ assets)
- controls could be nicer, but I resent being forced to make style decisions I would rather be in the hands of the user, and not me

The general problem with this is that the rest of the frame is built from references to data that you don’t yet have. You can’t reconstruct that, and you have to wait for enough other intra blocks to fill all of that in as well. Meanwhile, the parts that you do have may be making reference to other parts that you don’t have, so those references just produce more unknowns.

In order to make it possible for the decoder to enter the stream *somehow*, the
encoder has to maintain a keep-out region of the screen where references are
not allowed so that a decoder joining at an arbitrary point still has a way to
eventually construct a whole frame that doesn’t involve references to things
it’s never seen.

Loop filters make all this a bit hairy. Loop filters bleed regions together a little to help cover blocking artefacts from excessive quantisation, which means that what’s notionally a clean slice can be contaminated by the state of its neighbours, which may be unknown when entering into a rolling intra situation.

They’re small errors but they can build up if you’re unlucky.

Worse; in some of the more unfortunate codec designs these can propagate
indefinitely within the same frame; from the very top left of the frame to the
bottom. These perturbations are too small to provide any coding advantage but
they *do* undermine the decoder’s ability to handle things in arbitrary order.
It’s a recurring design flaw that nobody seems to care about fixing.

The reality is that the changes are usually too small to propagate anywhere
near that far, but it’s hard (maybe impossible) to *prove* that they won’t, so
you’re still stuck with this theoeretical causality problem restricting your
ability to reorder.

Let’s just ignore that and hope it happens rarely and washes out before anybody notices, because if you’re not using a codec that has fixed it then the best you can do is turn off the loop filter (which creates its own problems).

Back to getting a decoder back to coherency after joining a stream. Here are some techniques which can be used in constructing a solution (some complementary, some mutually exclusive):

The most obivous solution is to only refer back to parts of the screen already painted since a specific point in time. Typically the point in time when the intra stripe was at the top of the screen. Any decoder can then discard everything up to that point in time, and then start collecting segments until it has a complete picture. Then it can proceed as normal.

- conceptually simple
- smooths out I-frame cost across whole stream
- reasonable range of reference source data
- implied start point means everything back to that point can be used as a reference (under constraints).

- very high recovery time: decoder must wait for the start of the next frame before beginning to reconstruct, and must complete reconstruction before beginning to display

This allows the client to start reconstruction from any point in the stream so it has a constant wait time rather then having to discard.

- allows reconstruction to start earlier – on average 33% faster than waiting for top of frame

- inefficient; blocks can only refer one frame back and only to a thin stripe of that previous frame

`n`

framesThis creates a set of `n`

independent streams, so if one breaks when the others
can carry on at a reduced frame rate undisturbed.

- every new feature in a scene has to be transmitted
`n`

times

When the client loses sync it can tell the encoder that it needs help getting back on track,

After a frame is lost, ask the encoder to exclude references to the lost data. this has a turn around time cost, and can be quite high latency but much lower latency than having to wait for a full repaint.

- no need to wait for the whole intra reconstruction time

- have to wait for the round-trip time to the server to get things back on track
- reconfiguring the server’s encoding pipeline can cause other delays
- demands a large burst of I-frame data be delivered as quickly as possible, which may lead to more packet loss on a throttled link

Not the small-scale FEC you might see on a CD, but larger block scale. The internet is much more likely to discard whole packets rather than to deliver them with bit errors, so bit error corrections aren’t generally helpful (unless you do some complex transforms).

Send parity blocks of n packets so they if I’ve of those N is lost it can be reconstructed.

The obvious way to apply this is to split one frame into blocks and then add parity block(s) so that lost parts of the frame can be reconstructed from parity, so that the whole frame survives. Or doesn’t if you lose too many packets.

- no extra latency

- doesn’t always work
- costs extra bandwidth

It’s also possible to spread the error correction over a longer time. Create a parity block spanning a packet of the current frame and a packet of the previous frame as well. This means that if you lose the current frame’s packet then you can reconstruct it from parity combined with previous frame, but also if you lost that piece of the previous frame then you get a second chance at reconstructing it from the current chunk and parity.

While it may be too late to display the salvaged frame, you can still avoid the problem it would cause if you needed to reference it to draw the current frame.

This can be helpful in keeping things together when the latency of requesting a retransmit or re-encode would be too long.

- don’t necessarily have to drop out and begin reconstruction if a frame isn’t completed on time

- increasing the number of blocks covered by a parity block increases the risk of failure

TODO: it would probably make sense to show how to use several techniques in combination

with diagrams

]]>My own contribution to the field of making up improbable alien number systems involves counting in base 210. Rather than try to define 210 distinct symbols, or to make a 14-upon-15 or 15-upon-14 digit pair system, I chose to break each digit down into three components mod 5, mod 6, and mod 7.

Mod 6 is the fusion of mod 2 and mod 3 to try to make things a little less unwieldy.

This means that when counting in ones, all three components change, but they wrap around at different points, and you can derive the overall magnitude via Chinese Remainder Theorem.

It also implies that you can determine, at a glance, the remainder of a value divided by 2, 3, 5, and 7. And probably a few other things, too…

Can a human learn 210 distinct digits and also break these digits down into components which reveal useful properties for arithmetic? Can these be used to formulate new tricks for arithmetic that avoid having to learn a multiplication table with 44100 entries? I don’t know. Who cares? This is for aliens or whatever, right?

In our decimal world we use a lot of formal techniques and clever tricks for breaking problems down into something easier. This system needs many completely different approaches.

Breaking the whole digit apart into modulos 5, 6, and 7 you can simply add the individual parts with their respective modulos and stick them back together.

This gives you the least significant digit easily, but I’m not sure how to carry to the next digit.

TODO: figure that out.

TODO: try to coalesce my half-baked thoughts on this

TODO: try to coalesce my half-baked thoughts on this

A criticism levelled at metric is that because the units involve nothing but moving a decimal point, mistakes are harder to notice. When you convert from inches to feet, however, more digits change and the perturbation is more apparent.

Maybe this works here, too. Because small changes are very chaotic.

My peculiar numerical specification says nothing about shapes.

Shapes are hard.

They have to be easy to distinguish from each other, which is a hard thing to judge because when they’re novel they all look the same and you have to develop some familiarity before you can decide if they’re still genuinely confusing or not.

They have to be easy to decode in the face of some distortions. Looking at them at an angle without a reference point (we have this with 6 and 9, but we get by because an error of 180 degrees is extreme), or sloppy handwriting, or a font that didn’t take up all the ink it should have, or a pen that didn’t land soon enough during the stroke, etc..

And while it’s not essential if you just learn all the characteristics of all 210 digits individually, I wanted the components to be separable again to reveal the original underlying remainders.

These are the segments I ended up with on my first attempt:

In this scheme, the absense of a feature (an arc, or a ring, or an elbow, or a line descending from the centre) represents divisibility by a corresponding prime. Removing the feature also means that I need one fewer variations of that feature. So I managed to squeeze five states out of four positions of the ring, for example. It might be clearer to put the ring in the centre for the fifth state. I don’t know.

I may have had some other sub-patterns in mind when I planned these, but I don’t recall what they all were. This is left as an exercise for the reader.

Combining these segments gave me the following skeleton for a set of 210 digits:

That’s just a skeleton. The thing to do next would be to round these out into
more plausible, coherent, and distinct glyphs. For example, if they were
written with a pen, what path would that pen *actually* trace? When we learn
to write we learn to trace specific paths (with some variations). Even though
other paths through the same letters would theoretically have the same
outcomes, once that becomes slurred it gets harder to read. For example, we
write 5 starting in the top left corner, and then come back for the top stroke
after finishing the bottom loop. If we didn’t do that it would end up coming
out like an S.

Stroke order is even more important in Chinese. Not only does it ensure that a slurred version of the character is slurred in a way everybody expects and understands, but it defines the character’s place in a sorted list.

And the combination of that stroke order and slurring helps to give each digit a more unique character to make them easier to learn and distinguish at a glance. It also reduces mirror symmetry where the skeleton itself is symmetrical.

But if this is some alien system then they might not use a pen or a brush or a clay-poking tool. They might use stencils or plants or toenail clippings fixed in place with snot.

But I’m *not* satisfied with this skeleton. I think that maybe I can encode
more clues as to the underlying number theory in the relations between the
segments and the other segments with which they fuse.

Sticking with placing things around a ring, here’s another attempt:

This also neglects to address symmetries and how confusing they are (again, this might be addressed in a subsequent pass), and it has worse problems with the discrimination of small angles. It comes out like this:

This time the angles between each component say something about how many cycles have passed, and so we have a notion of the overall magnitude of the number. Distinguishing angles is a dubious prospect, so maybe they could be filled in with tick marks or somesuch. Those ticks could then be evolved into something both easier to draw and more distinct than just ticks.

Clearly there doesn’t need to be any circle at all. I just got hung up on that because of clock analogies, or whatever, and then I ran out of energy for exploring a fairly unconstrained space or figuring out how to better meet the constraints that I’ve given myself, because everything I have right now is way too symmetrical.

It’d probably help if I didn’t start with a circle.

Random thoughts:

- No circle.
- Maybe reserve symmetries for round numbers, or numbers with other interesting properties.
- In that second clock-face example the angle suggests magnitude mod 30 because two hands on a clock can’t help but show that off – how does one work in a third term where the difference between that angle and something else shows the larger magnitude mod 210?
- Figuring out magintudes given the remainders seems difficult, and may need an additional cue it the design of the digits. Why does the discrete logarithm spring to mind? That just feels like it’s more of exactly the same problem. But maybe…
- On divisibility by primes, for some reason Miller-Rabin springs to mind, but I don’t have a whole thought on that.

The Adler-32 checksum consists of two 16-bit sums. One is the sum of all the bytes in the data, plus one, and the other is the sum of all those intermediate sums, which works out to be the same as the sum of all the bytes multiplied by their distance from the end of the buffer plus the length of the buffer. Both modulo 65521.

Mod isn’t an operation you want to do regularly, so one typically does the sums in larger registers and periodically applies the mod before overflow can occur.

Pretty straightforward stuff on a scalar CPU.

To vectorise this, though, you can split the streams into N independent streams and then merge them somehow at the end.

Here’s how.

Supposing you have N-way SIMD accumulators, you can quickly calculate N independent Adler-32 checksums by adding incoming bytes to the first accumulator and adding that accumulator into the second accumulator, in parallel without mixing anything between lanes.

If the buffer is not a multiple of N long, you can pad it to a multiple of N by [notionally] stuffing zero bytes in front, because zeroes at the beginning do not affect the cumulative sum (zeroes at the end cause all the values in the B sum to get multiplied differently, so you can’t do that).

This gives you N sums in the form:

\[\begin{align} A_i = &\sum_{j=1}^{length/N} data_{(j-1)N+i} \mod 65521 \\ \\ B_i = &\sum_{j=1}^{length/N} (length/N - j + 1)data_{(j-1)N+i} \mod 65521 \end{align}\]When what we really wanted was:

\[\begin{align} A = 1 + &\sum_{j=1}^{length} data_{j} \mod 65521 \\ \\ B = length + &\sum_{j=1}^{length} (length - j + 1)data_{j} \mod 65521 \end{align}\]Which we can derive thus:

\[\begin{align} A = 1 + &\sum_{i=1}^{N} A_{i} \mod 65521 \\ \\ B = length + &\sum_{i=1}^{N} N \times B_{i} + (N - i)A_{i} \mod 65521 \end{align}\]Or something like that, anyway. I haven’t checked for things like off-by-one in my half-baked conversion from C notation to mathematical notation.

(TODO: make sure they’re actually right)

Having that reduction operation on hand (assuming I expressed it correctly), you can set it aside until you’ve calculated N parallel checksums and then do it as a finalisation step outside of any loops. Everything else is embarrassingly parallel.

Now it might seem like to avoid regular overflow and regular modulo we would need to use 32-bit accumulators with periodic resets. But actually not so much.

Given a loop in the form:

```
while (...) {
A += *byteptr++;
B += A;
}
```

If A and B start at zero, then it takes at least 258 iterations for A to overflow a 16-bit counter ($255 \times 257 = 65535$), but only 23 iterations for B to overflow ($255 \times {22(22+1) \over 2} = 64515$). Where $255$ is the largest incoming byte value, and $22(22+1) \over 2$ is the 22nd triangular number. So starting from zero we can do 22 iterations and both A and B will still be less than 65521 and fit in 16-bit counters.

We prefer 16-bit counters because in typical SIMD we get twice the throughput of 32-bit counters.

Then we need to fold those 16-bit sums back into larger counters (and/or do more modulo arithmetic). Like so:

```
while (...) {
// In the following loop A would get added into B 22 times, but we're
// setting A16 to zero to keep it small, and so we do those sums ahead of
// time:
B += 22 * A;
uint16_t A16 = 0, B16 = 0;
for (int i = 0; i < 22; ++i) {
B16 += A16;
A16 += *byteptr++;
}
B += B16;
A += A16;
}
```

But since we’re only doing those larger sums every 22 iterations, maybe it’d be better do the modulos to keep the arithmetic in 16-bits for twice the throughput there as well?

Yes and no.

Assuming A was previously less than 65521, `A += A16`

can’t ever be bigger than
71130, so we can simply test it for overflow and subtract 65521 if necessary.

That’s *slightly* complicated by 16-bit arithmetic, but we can make it work
something like this:

```
uint16_t tmp = 65521 - A;
A += A16;
if (A16 >= tmp) A = (A + 65536 - 65521) & 0xffff;
```

That’s more costly than a straight 32-bit sum, *but* having spent just a couple
of extra operations on keeping the A accumulator within 16 bits we have ensured
that B only has to handle growing by up to $23 \times 65520$ per iteration.
Which means that we can confidently do over 2900 outer iterations without
worrying about overflow.

Is it also worth reducing B to a 16-bit counter? Well, no, not really. The supposed benefit would be that we can do twice as many 16-bit adds as 32-bit adds, but without the knock-on effects on another accumulator the extra work in doing the modulo doesn’t justify itself. And this time we have to deal with 22 times a 16-bit value, mod 65521; so we’re stuck with that 32-bit temporary arithmetic anyway.

There’s probably a clever trick to do $x \times 22 \mod 65521$, in 16-bit arithmetic but I don’t know what it is. I may try to figure it out for fun, but it’s still not going to help here.

Also, since we’re really only trying to step back from the brink of overflow we don’t necessarily have to get the modulo exactly right. There’s a more expedient option:

```
if (periodic_overflow_control) {
uint16_t top = B >> 16; // close to division by 65521, but easier
B -= 65521 * top;
}
```

It’s so rare it’s probably not worth optimising for performance, but if you don’t have the necessary instruction available then writing this is much easier than writing a manual divide operation. We just have to use a more conservative reset period to stay safe.

If you don’t even have a multiplier then note that 65521 is 15 short of 65536, and calculating $x \times 15$ is the same as $x \times (16 - 1)$, and you’ll be able to work out something from that.

Here’s what that looks like in practice: adler32_simd.c

]]>Where I’m from conventional wisdom has always been that while it’s an objective fact that little endian is the one-true-and-correct endian, bit-streams should always be packed as big-endian data.

It’s a constant source of amusement (actually frustration) to see another data format stumble with this because x86 is little endian and ARM is little endian, and everything that matters is little endian, and somebody thinks they can simplify something by doing their bit packing in little endian as well.

I don’t think that’s a good idea.

Consider three symbols, a, b, and c, of different bit lengths laid out in a little-endian bit string:

Wait, what? That looks wrong.

Well, yeah. The convention is to write our bytes out in big-endian order, but we’re describing a little-endian bit string, which makes things look disjoint. Here’s the same layout but with the least-significant bit of each byte written on the left:

The implication for little-endian is that if you read just the first byte of the bitstream then you get the least-significant bits of some symbol first (depending on its size you may have all the bit of that symbol and the low bits of the symbol after it, but let’s not get ahead of ourselves). If there are more significant bits to be read in, as is the case with symbols b and c above, then they’ll be in the next byte.

This is consistent with a little-endian architecture where loading a word at that address means you can just shift the relevant bits into place.

Now looking at Deflate, for example, it’s notionally described as a little-endian format but when it comes to stuffing Huffman codes into the bitstream the codes are expressed in reverse order from the way they’re constructed.

This is because the first bits you can extract are the least-significant bits, and with variable-length codes you can’t know how big the symbol is (how many bits you need to decode it) until you’ve interpreted a portion of the prefix. So the prefix here must be the least-significant bits, which you decode early in order to figure out the length of the whole code.

This makes building Huffman tables a bit of a headache (which in the case of zlib this is done fairly frequently) because equivalent codes with unused suffixes jump around throughout the table, and it means you can’t pull more bits than you need and use magnitude comparison to decide which family of code lengths a symbol belongs to (which is normally a thing you can do with canonical Huffman, like zlib uses).

Here’s the same data packed as a big-endian bit stream:

(going back to the original convention because it makes sense this time)

Now the next byte always contains the most significant bits of the next symbol. Meaning for canonical Huffman that the prefix must be the most significant bits and so the most significant bits must code the length of the symbol.

Now if you read at least enough bits to decode any symbol, then comparing those bits with a threshold will work, and that can tell you how long the symbol is without a lookup table and in other coding schemes (eg., arithmetic or range coding) it may tell you the value of the symbol as well.

The joke is that there’s nothing inherently better about big-endian bit packing, but it just means that if you want to go little-endian then you should read your bitstream starting from the far end (the higher addresses) where the most-significant bits are.

So recently, when I was digging through Zstd, I was amused to see that while it’s yet another coding system using a little-endian packing, it does indeed start at the far end!

I haven’t analysed it that deeply. I don’t really know how ANS works and I can’t assert that it would all come out in the conventional order by turning the whole thing into a big-endian format. I just found it amusing. So I wrote this post.

]]>In the question of using curl to download a script from a site and piping it directly to bash, a lot of objections can be dismissed as the user already having made implicit trust decisions. Somewhere in that debate man-in-the-middle attacks come up and are routinely dismissed on the basis that all modern sites use TLS.

But that’s not entirely accurate.

Modern sites *offer* TLS. You only get it if you ask for it. If you don’t
then you’ll normally receive a redirection from the insecure site to the secure
one, *telling* you to ask for it. But by then it’s **too late**. That initial
connection was vulnerable to hijacking, and if a hijack did happen then the
unencrypted reply probably won’t contain the proper redirect. It could contain
anything.

If you start with http:// or don’t specify and your user agent guesses HTTP, then you fail to secure the set-up of the connection – even on a TLS-only site. And if you don’t secure the set-up then nothing beyond that can be trusted either.

HSTS provides a mitigation for this in web browsers, but it only works well
for sites that are preloaded. For any other site the
browser will first visit the HTTP site, and if that isn’t intercepted it’ll
follow that to the HTTPS site, which will advertise that the HTTP site is not
to be used any longer. The browser makes a note and then future connections to
that site are safe for a while because your HTTP connections will be
automatically rewritten before being attempted. Notably, the whole of the
`.dev`

TLD is preloaded in this list.

Second, on web browsers, you can use HTTPS Everywhere, or follow their instructions to enable the setting in your browser.

Curl supports settings to pick a default protocol instead of guessing, and it suports settings to enable HSTS. But neither of these are on by default.

Further, even after enabling HSTS, you still have to go find a preload file in a format supported by curl. I don’t know where that is.

So anyway… what I noticed the other day was a site using curl-to-bash with a
domain and a path, but without specifying a scheme for the transfer. And they
used the `-L`

command-line switch, which turns out to be for following
redirections as would be expected according to my description above.

I did a bit more research and eventually found eleven thousand more occurrences on GitHub (give or take the quality of my filtering of false positives). Almost one thousand of those were piped into sudo.

It’s not ideal.

Some actions you might consider:

- configuring
`~/.curlrc`

to enable HTTPS by default (does not override explicit http:// links) - configuring
`~/.curlrc`

to choose HSTS by default (might override*some*explicit http:// links) - configuring
`~/.curlrc`

to disable all unencrypted protocols (this might cause problems, because HTTP is still useful when used responsibly) - configuring your browser to use HTTPS by default
- eliminating all unnecessary http:// links from your bookmarks, documents, websites, etc..
- never copy-pasting unsafe command lines into your terminal
- never posting unsafe command lines on the internet for others to paste into their terminal

And check yourself on those last two. Will you *always* notice when something
is unsafe?

I’m sure it’s all very basic stuff for professionals, but it’s a few things I had to grind through as somebody who doesn’t really want to get involved in the web at all if possible:

- Inlining SVG
- Drawing SVG with the proper colours
- Liquid iteration to generate regular structures
- Iterating over strings, instead
- Adding colour
- Grouping related objects for mouse-over highlighting
- Optimisation
- SVG viewbox versus width and height

First, unsurprisingly, you can just inline SVG directly inside of markdown:

```
<svg width="100%" height="100" viewbox="0 0 100 100">
<circle cx="50" cy="50" r="40" />
</svg>
```

Astounding!

To respect dark-mode or other CSS overrides from the user it’s important to
avoid black-on-black diagrams, but it’s also good to avoid
black-on-**white-rectangle** diagrams, which can also be hard to read inside a
dark-themed page.

It turns out you can use `currentColor`

in SVG to draw lines in the current
text colour whereever that SVG is embedded. One assumes the text colour was
reliably chosen to contrast with the background. The background of an SVG is
transparent by default, so implicitly consistent with the surrounding context.

Also, to make a shape solid one can use `currentColor`

with a low opacity in
order to “tint” the background, rather than committing to a specific colour.

Hopefully that’s all working as intended on above circle.

```
svg {
stroke:currentColor;
stroke-width:1.5;
fill:currentColor;
fill-opacity:0.0625;
}
```

Unfortunately this breaks SVG’s text, which is normally rendered in the fill colour with no outline stroke. A fix-up is needed for that.

Also, I find it most convenient to anchor the text by its centre, so I can easily line it up with the centre of the things that it’s labelling.

```
text {
stroke:none;
fill:currentColor;
fill-opacity:1.0;
dominant-baseline:middle;
text-anchor:middle;
}
```

To draw a bunch of very similar objects it can be easier to generate them programmatically. This Liquid thingumy I seem to be using has loops, but arithmetic is excruciating. It seems to be a language very much in the spirit of COBOL.

```
<svg width="100%" height="120" viewbox="0 0 320 120">
<defs>
<clipPath id="clip34">
<rect x="3" y="3" width="34" height="34" />
</clipPath>
{%- for n in (0..15) -%}
<g id="box{{n}}">
<rect x="3" y="3" width="34" height="34" />
<text x="20" y="20" clip-path="url(#clip34)">
{{-n-}}
</text>
</g>
{%- endfor -%}
</defs>
{%- for n in (0..7) -%}
<use href="#box{{n}}"
x="{{forloop.index0 | times: 40}}"
y="0"
/>
{%- endfor %}
{%- for n in (0..7) -%}
<use href="#box{{n | plus: 1 | modulo: 8}}"
x="{{forloop.index0 | times: 40}}"
y="40"
/>
{%- endfor %}
{%- for n in (0..7) -%}
<use href="#box{{n | plus: 2 | modulo: 8}}"
x="{{forloop.index0 | times: 40}}"
y="80"
/>
{%- endfor -%}
</svg>
```

That looks a lot like it could use a nested loop, but I can’t figure out how to add two variables together, so I couldn’t make it work that way.

Is it really worth it, trying to generate an SVG file from source, programmatically, rather than just using some kind of editor?

Well, no, probably not but I did it anyway. I change my mind so often that as a project grows it gets progressively more tedious to re-arrange all the components and update the individual elements. Something CSS is meant to simplify.

So onwards I grind…

While arithmetic is painful, you can convert simple ASCII plans for a diagram with a bit of string manipulation. Splitting, mostly. So you can make 2D arrays with two different delimiter characters:

```
<svg width="100%" height="120" viewbox="0 0 320 120">
{%- assign table = " 0 1 2 3 4 5 6 7
: 1 2 3 4 5 6 7 0
: 2 3 4 5 6 7 0 1" %}
{%- assign rows = table | split: ":" %}
{%- for row in rows %}
{%- assign cells = row | split: " " %}
{%- for cell in cells %}
<use href="#box{{cell}}"
x="{{forloop.index0 | times: 40 | plus: 2}}"
y="{{forloop.parentloop.index0 | times: 40 | plus: 2}}"
/>
{%- endfor %}
{%- endfor %}
</svg>
```

Using approximately the same transparency trick before it’s possible to define a bunch of colours and then use those colours to tint solid objects to highlight that they share some property, or whatever. That’s just standard CSS stuff.

Here’s a palette devised by rotating around hue in steps of 360/phi while slowly ramping down the brightness and saturation to try to maximise the distance between colours:

```
<style>
svg {
{%- for n in (0..20) %}
--unique-color{{n}}: hsl({{-n | times: 222.5 | modulo: 360}},
{{-n | times: -3 | plus: 100}}%,
{{-n | times: -2 | plus: 50}}%);
{%- endfor %}
}
{%- for n in (0..20) %}
.tint{{n}} {
fill: var(--unique-color{{n}});
fill-opacity: 0.125;
}
{%- endfor %}
</style>
```

```
<svg [...] >
<defs>
{%- for n in (0..7) %}
<g id="cbox{{n}}" class="tint{{n}}"><use href="#box{{n}}" /></g>
{%- endfor %}
</defs>
[...]
</svg>
```

There. A touch of synaesthesia to emphasise the presence of diagonal stripes if the digits didn’t already do it for you.

To make it possible to for the user to choose to emphasise one class of thing
(like all the ‘0’ tiles below), a `:hover`

property can be used. It can
even be animated without JavaScript.

```
<style>
@-webkit-keyframes glow {
0% { fill-opacity: 0.5; }
50% {fill-opacity: 0.0; }
100% {fill-opacity: 0.5; }
}
{%- for n in (0..20) %}
.tint{{n}}:hover {
fill:var(--unique-color{{n}});
fill-opacity: 0.50;
font-weight: bold;
font-size: larger;
-webkit-animation-name: glow;
-webkit-animation-iteration-count: infinite;
-webkit-animation-duration: 1.5s;
}
{%- endfor %}
</style>
```

If you apply the class to a whole `<g>`

group, then (at least as far as I’ve
tested) everything inside the group reacts to the `:hover`

style in unison:

```
<svg width="100%" height="640" viewbox="0 0 640 640">
{%- assign table = " 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
: 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7
: 4 5 6 7 0 1 2 3 12 13 14 15 8 9 10 11
:12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3
: 2 3 0 1 6 7 4 5 10 11 8 9 14 15 12 13
:10 11 8 9 14 15 12 13 2 3 0 1 6 7 4 5
: 6 7 4 5 2 3 0 1 14 15 12 13 10 11 8 9
:14 15 12 13 10 11 8 9 6 7 4 5 2 3 0 1
: 1 0 3 2 5 4 7 6 9 8 11 12 13 12 15 14
: 9 8 11 12 13 12 15 14 1 0 3 2 5 4 7 6
: 5 4 7 6 1 0 3 2 13 12 15 14 9 8 11 10
:13 12 15 14 9 8 11 10 5 4 7 6 1 0 3 2
: 3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12
:11 10 9 8 15 14 13 12 3 2 1 0 7 6 5 4
: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
:15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0" %}
{%- assign pass = "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15" | split: " " %}
{%- for filter in pass %}
<g class="tint{{filter}}">
{%- assign rows = table | split: ":" %}
{%- for row in rows %}
{%- assign cells = row | split: " " %}
{%- for cell in cells %}
{%- if cell == filter %}
<use href="#box{{cell}}"
x="{{forloop.index0 | times: 40 | plus: 2}}"
y="{{forloop.parentloop.index0 | times: 40 | plus: 2}}"
/>
{%- endif %}
{%- endfor %}
{%- endfor %}
</g>
{%- endfor %}
</svg>
```

Enjoy these disco lights by waving your mouse over them:

One might imagine how this could be useful when creating a graph with too many lines to distinguish by colour, but being able to point at the key to highlight that line on the graph itself.

With the last pattern it becomes important to acknowledge the `{%-`

and `-%}`

I’ve used in the Liquid code. the addition of an hyphen on the left or the right deletes any whitespace on that side of the tag. That’s not generally a big deal but it builds up if you’re selectively filtering a lot of stuff in needed loops.

I got dinged by some linting tools for generating HTML files which were too big and I got things under the threshold mostly by just adding those hyphens. I also used `{%-''-%}`

and `{%-' '-%}`

at the start of lines I wanted to indent to dissolve this indents in the output.

Compression would be the next obvious step. I suppose it should be possible to gzip SVG data down to a small fraction of the size and to mime64 encode it and inline it with `src="data:image/svgz+xml;mime64,..."`

or outboard it as a separate file, but I’m not sure about how thoae options work with CSS and shared definitions and all that. And I’m not sure that’s a plug-in supported by Pages which works so the translation.

Worth mentioning because it confused me. The view box is the rectangle within
the SVG coordinate space (the units the SVG `<rect>`

and `<circle>`

objects
use) which will be scaled to fit the minimum of the `width`

and `height`

parameters in the context of whatever contains the SVG.

In the SIMD world precision usually bottoms out at 8-bit. You don’t save much by trying to get less precise than that, except for marginal savings in how often you have to mitigate overflows.

But what if you slice one matrix into bitplanes and then you replace multiplication with a hardware accelerated parallel conditional addition? You should be able to fit eight conditional adds into something like the same space as an 8-bit multiplier (but with fewer result bits), and you can take eight times as many input rows at once, to compensate for the fact that you’re only doing one eighth of a multiply.

To get the same precision takes eight iterations, over each of the eight planes of the input, with shifts and adds to merge the results, but you gain the opportunity to quit after fewer iterations for a proportional saving in overall run time.

It seems to fit neatly into a SIMD architecture, but I’ll have to go into that detail a bit later on…

]]>Regarding those quick-sort variants which do things like median-of-k pivot selection before subdividing…

A couple of years ago (before I lost access to the write-up I did at that time) it struck me that the diminishing-returns of using the median of larger samples could be amortised over multiple levels of recursion by keeping the outcome for later; because computing the median is a fair way towards sorting the whole sample.

So I wrote plpqsort, which might have stood for something like “pivot-list-prefixed quick sort”, and maybe it could be pronounced “plippick sort”. It’s hard to say what I was thinking back then.

But don’t worry about the name. It takes a lot more careful thought to do a better quicksort than what’s many people have put a lot of effort into already, and I only wanted to demonstrate a concept and slap a silly name on it so people could draw inspiration from its singular innovation.

The way this works is to take a random sampling of the unsorted data, and to move that sample to the front of the list (doing it this way isn’t efficient, but it represents a simpler conceptual model).

The median of these is going to be our pivot. But rather than compute just the median, the whole sample is sorted fully. This is now our “pivot-list prefix”, and the median picked from the middle of that prefix.

Then partition the data after that prefix in the usual way.

Then, exchange the top part of the low portion for the top part of the pivot list (again, maybe not an efficient way to go about it), and set the start of the second partition to start at the part of the prefix that was just moved.

Now we have two partitions, one less than or equal to the pivot, and one greater than the pivot; and each partition begins with a sorted prefix. Exclude the old pivot. We’re done with that.

And recurse.

Now we’re beginning to amortise. Since the prefix is already sorted we don’t
need to do that again. It is half the size, but so long as it’s not *too*
small we can still use it. Otherwise collect a fresh sample and sort that.

That’s it.

Does it work? Hard to say. My implementation is dumb, and my benchmark is dumber.

The only metrics I looked at were swaps and compares, and they didn’t turn out great. A serious implementation would probably break out of STL and resort to SIMD optimisations and suchlike, and make a proportion of the operations not-worth-measuring; and I just wasn’t going to spend the time on that because it’s not meaningful until the rest of the algorithm has been carefully designed and tested.

The prefix list also offers insights which I didn’t try to exploit because I didn’t have a realistic benchmark in which to validate that effort.

- Before you sort the prefix, is it already sorted (collect samples at ascending offsets to benefit from this)? That might be worth a closer look.
- After you sort the prefix, is some value grossly overrepresented in the sample? Or at the very least, does the pivot value also appear at some distance threshold from the middle of the prefix (it’s sorted so there’s no scanning involved to figure this out)?
- Pack a list of samples in the range of the block being partitioned into a SIMD register, and for every value visited increment a counter for each lane where the visited value is less than that register’s value. At the end you can use this to interpolate a much more accurate pivot for the next round. Or even use it to pre-lay bins for an American-flag sort.

But it’s an idea. With a name.

]]>`unordered_map`

are a bit
scary. The function to reduce the range of the hash down to the size of the
table is remainder, and the default hash for integers is the identity function.
Yikes.

To be fair, no matter how naive or lazy an algorithm seems, it probably has cases where it excels. Here, if you’re making a table of ints distributed uniformly over an unspecified range without many gaps it may be almost the best thing (just optimise that mod), but that’s almost a vector.

Obivously I, having my blog almost entirely dedicated to the subject, believe that the proper way to map a uniformly distributed random bit pattern (as a hash should be) into a fixed integer range is with a single multiply. Not a remainder.

But STL has other ideas, with its identity hash functions and the like.

What might I do instead?

If we treat our hash as a 64-bit fixed-point random number representing a range
of values in [0,1). Then by multiplying it by a constant `k`

, we get a random
number in the range of [0,k), with some fractional part which isn’t useful
(yet).

In plain-old integer arithmetic that’s `(hash * k) >> 64`

, but in C this would
typically result in overflow and an unusable result, and we would have to cast
to a temporary data type: `(uint64_t)((uint128_t)hash * k >> 64)`

.

128-bit arithmetic isn’t necessarily cheap, but don’t worry about it. On most
hardware there’s an instruction to do just this operation. I’ll call it
`mulh`

. So assume this is presented in the C world like so:

```
static inline uint64_t mulh(uint64_t x, uint64_t y) {
return (uint64_t)((uint128_t)x * y >> 64);
}
```

and that once that’s inlined it’s a single machine instruction which is probably much less costly than division or remainder.

Supposing, first of all, that the hash function you use gives a properly
distributed 64-bit number (a reasonable expectation from `size_t`

on a 64-bit
platform).

In that case, the operation to convert a hash to a bucket index is:

```
size_t constrain_hash(uint64_t hash, size_t nbuckets) {
return mulh(hash, nbuckets);
}
```

Great! So easy! So fast!

Couple of problems, though.

First, if your hash isn’t “hashey” then the values might all be clustered at one end of the range, and consequently all the reachable buckets will also be clustered down one end of the range as well. If your hashes are all identity hashes of a bunch of 32-bit ints you’ll never get anything but the first bucket.

The other problem we’ll save for later.

To fix the first problem we need to condition the input to be evenly distributed. Apply some function to it that maps every 64-bit input to a 64-bit output. Basically a hash of the hash – because somebody didn’t do their job right the first time around. This hash should be bijective (and “perfect” in hash-speak), because there’s no compression happening.

What’s a good conditioning function? Well, that’s a really hard question involving trade-offs between performance and quality. If only we didn’t have to compromise. If only…

First, the classic:

```
static inline uint64_t murmurmix64(uint64_t h) {
h ^= h >> 33;
h *= 0xff51afd7ed558ccdULL;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53ULL;
h ^= h >> 33;
return h;
}
```

So you might write:

```
size_t constrain_hash(uint64_t hash, size_t nbuckets) {
hash = murmurmix64(hash);
return mulh(hash, nbuckets);
}
```

I’m not sure this is going to be faster than division, but it will at least be exceptionally resistant to collisions.

A cheaper cut-down version of that might also do, given that we do not need to focus on the quality of the low-order bits for what we’re doing:

```
size_t constrain_hash(uint64_t hash, size_t nbuckets) {
hash *= 0xc4ceb9fe1a85ec53ULL;
return mulh(hash, nbuckets);
}
```

And an honourable mention to CRC, here, since it’s hardware accelerated. The operation only returns 32 bits, but it can be a 32-bit hash of a 64-bit input. Given a function representing that operation, we can write:

```
size_t constrain_hash(uint64_t hash, size_t nbuckets) {
return crc32(hash) * nbuckets >> 32;
}
```

However, this only works for modest values of `nbuckets`

. If it gets close to
$2^32$ or higher then things get poorly distributed.

TBD: I have another function, which I haven’t thought through fully, for if you know that your low-order bits are well distributed, but I need to figure some details out before I put it in here. The core of it is:

```
hash = (hash >> 6) ^ ROTR64(0x2b63207ef09cd4ba, hash & 0x3f);
```

That bit pattern is a de Bruijn sequence, and that operation (on its own) just sprays the low six bits across the whole word in a randomish and bijective way. But six bits isn’t a good enough, so I have to make a better function than that.

Fun fact: this whole `mulh`

extraction process is kind of like a range
coder. That means that if you keep the low-order bits (the fraction, the
bit that `mulh()`

notionally threw away) after extracting your first bucket
index, you can repeat the process to get a fully independent variable – in
either the same range or a new range if you prefer.

So if you come to a point in algorithm design where you might consider
re-seeding the hash and recomputing it to get a different approach to the
table, you don’t *have* to recalculate the whole thing. Just repeat the
`mulh()`

operation again on the residual from the last call.

```
static inline uint64_t mull(uint64_t* x, uint64_t y) {
uint128_t p = ((uint128_t)*x * y);
*x = (uint64_t)p;
return (uint64_t)(p >> 64);
}
size_t constrain_1st_hash(uint64_t* hash, size_t nbuckets) {
*hash = murmurmix64(*hash);
return mulh(hash, nbuckets);
}
size_t constrain_nth_hash(uint64_t* hash, size_t nbuckets) {
return mulh(hash, nbuckets);
}
```

Be mindful of the earlier comment, above, about being lazy with the low-order bits. If you mean to consume the whole hash piecemeal then this becomes a little less true.

Don’t do the conditioning operation every time, though. Or do? I haven’t figured out what that says about independence of the results if you do, but if you don’t then the results stay independent for as long as possible (the degree of non-independence is actually a noisy function which is always present but grows from the low-order bits and you probably won’t suffer from it early on).

Funner fact: provided all of your ranges are odd (generally true for prime-sized tables), when the hash runs out of independent parameters it’ll just carry on hallucinating new keys for you which are no longer independent of what you’ve seen before but are, I think (TBD), a unique sequence for each initial hash.

This is because multiplication by an odd number mod a power of two is a bijective operation and doesn’t cause the hash to decay to a predictable value or orbit which might be shared with other initial states (though many will be at different phases on the same orbit).

By naively using `mulh`

, when we increase the size of the table the order of
the hashes stays stable and new gaps appear in between existing entries.

This sounds like a good thing, and maybe it is. If nothing else, when it comes time to resize the table that operation could be implemented in a more-or-less linear way, and the CPU can stream into and out of cache efficiently.

But if we were having a high rate of collisions in a particular area of the table (ie,. your hash sucks and your input conditioning isn’t good enough to fix it) then this kind of scaling won’t relieve the problem effectively.

Remainder doesn’t have this problem, because the values that map to the same bucket under mod are regularly spaced under one modulus won’t generally map to the same bucket under another modulus used for a larger table if they’re co-prime – and in most implementations each modulus is prime.

For this reason one would probably want to tweak the conditioning function to take a parameter – a seed, or salt – and to randomise that parameter every time the table has to grow.

In fact, if the collision rate sucks but the load factor is low, maybe just change the conditioning seed without even growing the table.

This tweak should probably also be in place if there’s a risk that the table might come under attack by contrived input. Just saying.

If this isn’t for you then you can still optimise the remainder operation by hard-coding the constant into a function. The compiler knows how to convert mod-by-constant into a couple of multiplies, shifts, and adds. Then you just need one function for each divisor you might use, and a function pointer to the right one to use.

It’s not *as* fast, but it’s still better than division.

That’s not how this works. I’m not here to show how one particular implementation wins at one particular benchmark. This is just a note on some techniques for possible consideration when desiging an implementation – whether that be generic or application-specific. There are so many other factors in the design of a hash table, and the interaction between these methods and others in an existing or prospective implementation need to be tested in that context.

]]>