If you were wondering how competitors tackle the complexities of something like last year’s hardest Challenge (10B), our competitor Cribbage compiled the following useful short guide to their thinking. Huge thanks to Cribbage for taking the time to do this!
Here it is:
First step for 10B: study the various indices of coincidence (see pages 15-16 of the guide produced for the Cipher Challenge here:A-beginners-guide-to-codebreaking. Essentially the IoC is a measure of randomness or patterns in the text: a text of “AAAAAAAAAAAAAAA”, for example, would generate an IoC of 1, whereas a completely random text such as “ZUXDMZZITLESRRDBXWNK” (where each letter of the alphabet has an exactly equal 1 in 26 chance of appearing next) would generate an IoC of around 0.0385. It is the scores in between 0.0385 and 1.0000 that are interesting. A typical passage of English has a score of around 0.065, and this is because it has patterns and is not fully random – the letter ‘e’, for example, has a much greater chance of appearing than a ‘z’. If a cipher has been used that is polyalphabetic, i.e. it involves more than one alphabet, such as a Vigenère cipher (see earlier challenges this year), then the IoC score will drop below c. 0.065, but will still be higher than the completely random text score 0.0385. The more alphabets are used in the cipher, the more the text will look like random text, and the closer the score will get to 0.0385. For 10B, the IoC score was 0.041, and this suggested several alphabets had been used as a score of about 0.040 can suggest that about 8 alphabets have been used.
To find out exactly how many alphabets, you need to try out different numbers of alphabets. So, if the ciphertext was “ABCDEFGHIJ”, and if you suspected that that three alphabets had been used (i.e. one alphabet for the letters at positions 1, 4, 7 etc.; another for letters at positions 2, 5, 8 etc.), then you would group the letters for each alphabet together. So here: alphabet 1 = “ADGJ”, alphabet 2 = “BEH”, alphabet 3 = “CFI”, and then you would work out the IoC for each alphabet group and then work out the average of all the IoCs. You would then repeat this to test to see what the IoC average looks like if 4 alphabets were used, dividing up the plain text differently, working out the IoCs, finding out the average, and so on.
When this was completed on 10B, the best IoC average was for 7 alphabets: it came out as 0.068, i.e. around that of typical English.
So this suggested there were 7 alphabets being used. There were other clues for this as well: there were 7 different single letter words in the ciphertext (T, G, Y, C etc.). This also suggested (but not conclusively) that 7 alphabets were used – as it turned out, they were the different ways used by the seven alphabets for encoding the word “a” (there were no “I”s in the plaintext on this occasion). Also, if your ciphertext is long enough, there is a chance that the same plaintext will be encoded in the same way in different places, even if several alphabets have been used. In the second paragraph of the ciphertext, it is easy to spot the word ‘JQ’, which appears three times in a couple of lines (lines 4-5 as it appears on the website). If you count the number of characters (do not include spaces or punctuation) between the repeating ‘JQ’s, then, if the ‘J’ of the first ‘JQ’ is at position 1, the ‘J’ of the second one is at position 29, and the ‘J’ of the third ‘JQ’ is at position 71. The gaps between these numbers are 28, 42 and 70. As well as 2, 7 is a common factor to all of these. When the message is deciphered, it will be seen that all three of these ‘JQ’s represented the English word ‘OF’, and, by chance, they were all encoded with the same two alphabets from the seven used. There are other ‘OF’s in the ciphertext (e.g. ‘FG’ near the end of the first line will also turn out to be an ‘OF’) but they were encoded with two different alphabets from the set of seven. The word ‘JQ’ appears 8 further times in the ciphertext, and they all appear at multiples of 7 from each other. The last one appears 3402 characters from the very first one in paragraph 2, and 3402 = 486 x 7.
So far, all of the above would be the same first steps in cracking a Vigenère cipher, but, despite there be many Vigenère ciphers already in this competition, this was not one. A Vigenère basically applies a different Caesar shift to each of its alphabets, and solving them is relatively straightforward once you have the number of alphabets (or key length) used. You simply do a frequency analysis of the collection of ciphertext letters encoded by each alphabet, and then work out how much you need to shift (à la Caesar) each alphabet so that the most frequently occurring letters become E’s, T’s and A’s, i.e. the 3 most frequently occurring letters in English. This method did not work with 10B, so something else was obviously going on, but with 7 different alphabets. My next guess was to think that perhaps each of the 7 alphabets was a different substitution cipher. This turned out to be correct, but it took a little while before this was confirmed. I did a frequency analysis of each pool of 7 letters to find out the most frequently occurring letter in each group, and, in order of the alphabets used, they were: f, s, i, n, p, o, k. So, if the 2nd or 9th or 16th or 23rd (etc.) ciphertext letter was an ‘S’, then, as ‘S’ was the most frequently occurring letter in the second alphabet’s pool of letters, I would assume that it represented the plaintext letter ‘E’. This told me where all the ‘E’s could be found in the message.
‘T’ is the second most frequently occurring letter in English, so I repeated this process looking for the second most frequently occurring letter in each alphabet group, and I then assumed that it was a ‘T’ in each case. For reference, the seven T’s were (in order of alphabet groups): u, j, x, a, d, n, z (which is really close to a Caesar shift of ‘archive’, but sadly isn’t ☹). Using a spreadsheet, I looked at what this told me about the letters uncovered so far, and here I had a bit of luck, the fourth word of the ciphertext came out as: E-T——-E-T. Entering these letters into an online crossword solver [Which is just about within the rules – Harry] revealed that the only possible English words that fit this pattern are:
Words 1 and 4 were unlikely (are they even words???). In fact, words 1, 2 and 4 could be ruled out – they all had additional ‘E’s and/or ‘T’s in them, and I was meant to have found all of them already. The word ‘establishment’ fitted the tone of a cipher challenge message and contained no more ‘E’s and ‘T’s. This then gave me a collection of extra letters that I now knew, so I added these to the spreadsheet. Progress after this was still slow for a while. I had solved the substitution ciphers earlier in the competition manually (with the help of some frequency analysis) as it is satisfying to do so. With those, if I had known all the ‘E’s and ‘T’s and the word ‘establishment’, it would have taken a couple of minutes to finish off the cipher but, with 7 alphabets to contend with, it was as though the impact of each breakthrough was divided by 7. If I discovered an “a”, for example, I was only discovering the “a” for one of the seven alphabets used, and there were still six more “a”s associated with the other alphabets to be discovered. For about 20-30 minutes I cautiously added letters to my ‘known’ collection and, eventually, the message started to take shape in front of my eyes, and progress became more rapid as I filled in the remaining gaps.