Reply To: Extracurricular challenges
A Tale of 2 Secrets › Forums › T.E.M.P.E.S.T. › Extracurricular challenges › Reply To: Extracurricular challenges
@Crackerjack_404 I really like Rust as a middle ground between pure functional and imperative languages. There are aspects of functional languages that I really like, particularly sum types, match, and higher order functions like map, all of which are included in Rust via enum, match and Iterator. I sometimes feel that I have to contort the algorithm in my mind when writing purely functional programs, and it’s nice to have the option to write imperatively (sometimes imperative code is just neater and easier to read). Rust also has default immutability and variable shadowing which are small things that greatly reduce the number of stupid mistakes I make. Option and Result are great and I fully understand how they work, whereas all the jargon about Monads (joke explanation: “a monad is just a monoid in the category of endofunctors”) and other weird maths concepts go a little over over my head. I’m sure that all the mathematical concepts that Haskell encodes are very beautiful!
The Rust program I wrote has a very small memory footprint because, as I detail below, I decrypt the ciphertext directly into an array of letter frequencies, so the program is CPU limited rather than by memory. At risk of going on too much about Rust, I love that it makes easy-to-use parallelism possible and safe.
I did end up writing a GPU compute shader, but it was only around 10 times faster than the Rust program (when run on 12 cores) which was somewhat disappointing. It’s quite possible that there’s something I’m missing that would make it faster, as this is only the third compute shader program I’ve written (the first two were a brute-force SHA1 searcher and a rainbow table generator). My hill program uses two separate shaders, a map stage and a reduce stage.
In case you don’t know, GPUs divide work up into workgroups and invocations (and subgroups…) which sort of address points in a virtual 3d cuboid. There are x*y*z workgroups, and each workgroup contains u*v*w invocations. The global invocation index is made up from the coordinates of the workgroup and invocation. I used a 64*64*1 workgroups with 16*16*1 invocations.
The map stage decodes the global invocation index into a row of the inverse matrix, then decrypts the ciphertext once, incrementing counters for each letter in a 26-element (or whatever the modulus is) array which keeps the memory requirements down. I compute the IoC from that array, then using a precomputed array of letter counts from a corpus I find the relative shift resulting in the greatest angle between the counts vectors, ie. max_i(angle(corpus_frequencies, rotate(plaintext_frequencies, i))). If the score is above a threshold and the IoC is sufficiently close to the corpus IoC, I atomically (candidates[atomicAdd(index, 1u)] = … in WGSL) add the global invocation index and the angle score to a fixed-size array of candidates; I used a size of 10 per workgroup. So for each workgroup there’s an array of size 10 * sizeof(Candidate) bytes which is all contained in a big output buffer. This buffer is around 320KiB.
I then used a workgroupBarrier() which ensures that any memory writes in any invocation before the barrier are visible to reads in any invocation after the barrier. Then I bubble sort the candidates in the first invocation only (when local_invocation_id == 0). Typically each workgroup will have zero or, very occasionally, one candidate so even though this bubble sort isn’t parallel I don’t think it will have a noticeable effect on performance (it’s dwarfed by the time taken to decrypt and analyse the text). I could do this in another stage (possibly using a parallel sorting algorithm), but because I filter out candidates with score < 0.9 there are so few elements to sort that the overhead of another compute pass is probably higher than this bubble sort.
The reduce stage is basically just the merge() function from mergesort. It operates on two lists of candidates, and puts the (up to) 10 highest value candidates in the first list. This runs for around log2(n) times, where n is the number of workgroups.
I divide the space of inverse rows to search into fixed size chunks so the GPU doesn’t silently complain about me giving it too much work at a time. For a 5×5 matrix, the map stage takes around 60ms and the reduce stage around 50µs (miniscule), with overall runtime around 200ms.
Something I do for the CPU version is filtering out bad inverse rows by taking the IoC of the first 50 characters and stopping early if it’s too far out, which speeds things up a fair amount. IIUC I can’t do this on a GPU because all the invocations run together until the last one completes, and when I tried this it made little difference. The biggest performance improvement was, somewhat surprisingly, allocating a few large buffers that precompute the integers modulo the alphabet length, including one of size matrix_dimension * pow(alphabet_length – 1, 2). My GPU can read the corresponding entry of the buffer much faster than it can do the modulo operation (I guess that’s not all that surprising because division is generally not a primitive instruction).