A reasonably effective hash instruction
As per part one of this post, I have reasoned through and tested a proof of concept for generic CPU instruction which could achieve most of the required mixing of a good hash function in a single cycle (working with a combinational depth of 20 to 24 gates as a guide).
I wanted something that could be used to permute a 64bit value, or be used to merge two 64bit values into a single 64bit result (because hashing is generally also oneway compression).
I decided to try a simple substitutionpermutation network. In one phase, take small groups of bits and replace them via a substitution box (a lookup table), and in another phase rearrange the individual bits to new positions in the output where they will have new neighbours in subsequent rounds.
substitition
A substitution in this case means that a change in any one bit becomes the possibilities of changes in n bits, with the specific outcomes depending on m other bits.
I chose a fourinfourout sbox, because it wouldn’t require many levels of mux to implement. In principle this means there are 16! possible tables, but most of these are not useful. To simplify the logic further I decided to try a table derived from a de Bruijn sequence, of which there are only 256 possibilities. Most of these are rotations of other sequences so there are only 16 unique choices if you don’t count rotations, but different rotations perform differently leading us back to evaluating all 256 choices.
Use of a binary de Bruijn sequence means that rather than using a lookup table, a shift can achieve the same result. For example:
x = ROTR16(0x4de1, i) & 15;
will return 16 different values for 16 different inputs (i
).
To decide which of the remaining 256 possibilities was best I tried all of them in SMHasher3, in various configurations, to see which one was hardest to break.
permutation
To spread the effect of the sboxes as rapidly as possible the permutation should take all n bits from one box and distribute them among n different boxes for the next round, ensuring that each bit has never encountered its new neighbours before (until this can no longer be avoided). More specifically – that each new bin contains bits that have never shared influence from a previous sbox.
This can be done by rotating the index of the bit by two bits (two being log_{2} of the size of the sbox) to find its new bit index. Effectively multiplying or dividing the bit position by four, and filling in the empty bits with those shifted off the end.
I also tried another deBruijnsequencebased solution which ensured that every bit position eventually marched through the whole word, while also meeting the same uniqueneighbour constraint in the first n rounds. That works, and has fun properties, but it doesn’t improve things meaningfully (at least not unless you subscribe to the idea that each bit position has water memory of its old value before the substitution).
I ended up with the following round function:
uint64_t round(uint64_t x) {
const uint16_t magic = 0x613d;
uint64_t y = 0;
for (int i = 0; i < 64; i += 4) {
int j = (x >> i) & 15;
uint64_t k = ROTR16(magic, j) & 15;
y = k << i;
}
uint64_t z = 0;
for (int i = 0; i < 64; ++i) {
int j = ((i & 15) << 2)  (i >> 4);
z = ((y >> j) & 1) << i;
}
return z;
}
I believe the shift is still a select 1 of 16 but with size optimisations available (shared results between adjacent bits, unlike a conventional mux), or four layers of binary mux, and the permute is just wires (but maybe fairly long wires). So I think I can call that eight gates deep?
packaging as an opcode
Doing the above in software is clearly absurd. It achieves nothing that can’t be done much faster in simpler operations. But if it were packaged as a machine instruction it could (I believe) do in just two or three cycles what software currently takes maybe a dozen cycles to achieve.
what makes a useful opcode
The function defined so far needs four rounds for influence from one bit to reach every bit, but that’s a weak connection and a few extra rounds would be necessary to get things properly dispersed.
But to get the operation done in a single cycle on a fast CPU four rounds may be ambitious, and doing it properly may not always be necessary. Two or three rounds seems to work well enough if the instruction is used carefully and if we squeeze in a tiny bit of extra mixing just to be sure.
Because it’s a hashing operation, not only does it need to permute its working state, but it also has to integrate the data being hashed. A twoin, oneout integer instruction suits this purpose. But care must be taken to define the combination of two inputs in such a way that common collision scenarios don’t result in cancellation – where too many different inputs lead to the same output.
a tiny bit of extra mixing
Ideally, hash_op(x, 0)
, hash_op(0, x)
, and hash_op(x, x)
would all be
bijective functions of x
(the last one because cases where the same value
goes to both operands should not undermine hash strength). So the
preconditioning of each operand must be bijective, and the combination of the
two, applied to the same original input, must also be a bijective operation.
After the expected minimum of one exclusiveor to combine the two inputs, I hoped to avoid more than one additional exclusiveor to precondition the inputs.
A popular bijective mix function in crypto is to exclusiveor a value with two rotated copies of itself. But I don’t want that many extra gates, and it appears that to be a bijective transform the number of copies of the input exclusiveored together must be coprime to the length of the word (ie., odd). Even if I did do three on one input and three on another, the total after combining the two would be even, and not coprime.
But while exclusiveoring a value with one rotation of itself is not bijective, if you zero one bit in the rotation then it becomes bijective by virtue of the fact that you can unwind the exclusiveor bitbybit starting at the one bit that was exclusiveored with zero (provided the rotation is also coprime to the length of the word).
I have no idea how to design a pair of these operations which can then be excusiveored together to form another bijective function so I wrote a search program to find a configuration which worked. That gave too many results so I reran the search with a few parameters locked down to satisfy some notion of “evenly spaced” (though that notion is poorly reasoned):
 combine using a rotation of 32 (swap top and bottom halves)
 precondition each input with a rotation near 16 (these must be odd)
 make sure the dropped bits don’t land in the same sbox in the combined output
Here’s what I got:
uint64_t premix0(uint64_t x) {
return x ^ (ROTR64(x, 15) & ~(uint64_t(1) << 10));
}
uint64_t premix1(uint64_t x) {
x = ROTR64(x, 32);
return x ^ (ROTR64(x, 17) & ~(uint64_t(1) << 17));
}
Using the same logic as above, that’s two gates for the exclusiveor, and nothing but wire for the rotation.
Combining all of the above operations, the final instruction looks like this:
uint64_t hash_op(uint64_t x, uint64_t y) {
x = premix0(x) ^ premix1(y);
x = round(x);
x = round(x);
return x;
}
And I think that’s two gates for the premixes, two for the combine, and 16 for the rounds. 20 gates deep?
There’s something missing from this stage. ROTR64()
on its own is a fairly
uninteresting permutation because neighbours before the operation are mostly
the same neighbours as after. There would be more mixing of neighbours
achieved if the permutation were more convoluted. This has caveats where not
every permutation leads to a bijective operation.
This is an example of how crypto operations tend to let us down. They’re defined on simpler operations, at the cost of slower mixing^{1}, and use many rounds to compensate.
One mitigation is to permute x
before and after the whole premix operation.
This only does half the job, and what would be better still is uncorrelated
permutations for the two inputs to the exclusiveor, but chosen to ensure
that the overall operation is still bijective. Also, since wiring isn’t
actually free, permutations chosen to synthesise sensibly.
I have not addressed this here.
usage
This seemed to work (ie,. pass tests) as a generic hash implementation:
uint64_t hash_function(uint8_t const* ptr, size_t len, uint64_t seed = 0) {
const uint64_t hlen = hash_op(len, hash_op(seed, len));
uint64_t hash_lag1 = hash_op(seed, 0);
uint64_t hash = hash_op(0, hash_lag1);
while (len >= 8) {
uint64_t data;
memcpy(&data, ptr, sizeof(data));
data = hash_op(data, hash_lag1);
hash_lag1 = hash;
hash = hash_op(hash, data);
ptr += 8;
len = 8;
}
// tail stuff...
hash = hash_op(hash, hlen);
hash = hash_op(hash, hash_lag1);
return hash;
}
The assumption here is that data = hash_op(data, hash_lag1)
issues
concurrently with the previous iteration of hash = hash_op(hash, data)
.
Use of the lag1
variant helps to compensate for the limited dispersion of the
reducedround opcode. If two subsequent words in the input have certain
structural correlations then each of their first rounds will be permuted by
a different argument. This helps disperse those correlations in a way that
passing 0 or a seed value can not.
Strictly this argument should be a pseudorandom sequence. Even a Weyl sequence is probably sufficient. Deriving it from a lagged version of the data would require proving that there’s no possibility of selfcancellation where the hash has fewer than the full range of possible output states. But this magic sequence is available without extra computation, and it seems to work just fine.
Integrating the length into the hash is done after the other data is processed on the basis that the length may not be known in advance (in a more complex use case). But when the length and seed are constant many of those surrounding operations fold down to constants as well.
results
The above functions seem to pass all smhasher3 tests, so I guess it’s good? There are a lot of popular, highperformance hashes which don’t pass all these tests.
In fact, in the final configuration most of the sboxe candidates can pass all tests; but I’ve stayed with one which passed in a variety of weaker configurations during development as well.
In a PRNG configuration:
hash_op(hash_op(seed0 += k, seed1 += (seed0 < k) ? k : 0), 0)
fork = 0x9e3779b97f4a7c15
survives PractRand until at least 32TB, and passes all of the BigCrush suite in TestU01.hash_op(hash_op(seed += 0x9e3779b97f4a7c15, 0), 0)
and persists in the PractRand suite until 1TB, where it fails.hash_op(hash_op(hash_op(hash_op(seed++, 0), 0), 0), 0)
completes at least 16TB (with two anomalies but no failures) and also passes all of BigCrush.
future work
 be more scientific
 be less handwavy
 requirement: pick an audience and decide what they already know
 look at 5bit sboxes
 32bit version
 for 32bit architecture
 for Feistel network constructions (still not crypto!)
 SIMD version
 in all likelihood nobody would want to implement a strong mix operation beyond 64bit boundaries in SIMD even if that would be very, very useful, so a construction would be needed that gets by with something very simple
 try putting the mix in the middle – do not be tempted to use different round
functions for
op1
andop2
; it makes it very hard to reason on whether or not the combination is still bijective when they’re equal this does increase area, as
op1
andop2
each need a round of their own, but they can be done concurrently for equivalent timing  this also seems to weaken use as an RNG conditioning function – I guess it doesn’t benefit from quite as much avalanche from the extra mixing at the very start of the operation
 this does increase area, as
 find a totally different algorithm which has applications relevant to its specific implementation to justify its existence, but which also has the dispersion properties of a deliberate dispersion function
 find more applications for this algorithm specifically?

For good reason, but not a reason I care about. ↩