Examining the SHA-1 collision files

SHA-1 is a cryptographic hash function. You give it a computer file, and it produces a 160-bit hash that is completely determined by the input file, but not in any obvious way. In early 2017, a group of researchers, using advanced mathematics and 6500 CPU-years of computer searching, found the first ever SHA-1 collision: two different files that have the same hash. They presented this discovery in the form of two PDF files that have the same SHA-1 hash, but are clearly different. One has a blue background, and one has a red background.


But never mind the impressive parts of this accomplishment. What I want to look at is the boring details of how they constructed the non-critical parts of their colliding files. Their website about this (https://shattered.it/) has a diagram that supposedly sheds some light on it, but it’s extremely lacking in detail. It does indicate that the differences are in an embedded JPEG image.

Of course, one should not be surprised that two different PDF files would look different. But to the best of my understanding, the researchers were operating with some pretty serious restrictions. They couldn’t just start with two different files, each with a designated “scratch area”, and have the collision search fill in the scratch areas so as to make the hashes the same. That would have made it easy.

Instead, the files had to be exactly the same except for the scratch area (which I’ll also call the “critical” part), and everything before the scratch area had to be chosen, and set in stone, before the search began.

I downloaded the files, shattered-1.pdf and shattered-2.pdf. (I’ll sometimes refer to them and their respective contents as “File1” and “File2”.) They are both 422435 bytes in size. I’ll confirm that they have the same SHA-1 hash:

$ sha1sum shattered-*.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-2.pdf

Using a different hash function, I confirm that they are different:

$ md5sum shattered-*.pdf
ee4aa52b139d925f8d8884402b0a750c shattered-1.pdf
5bd9d8cabc46041579a311230539b8d1 shattered-2.pdf

To figure out which parts are different, I’ll make a hex dump of each file, and compare them:

$ od -A d -t x1 -v shattered-1.pdf > hexdump1.txt
$ od -A d -t x1 -v shattered-2.pdf > hexdump2.txt
$ diff hexdump1.txt hexdump2.txt
< 0000192 73 46 dc 91 66 b6 7e 11 8f 02 9a b6 21 b2 56 0f
< 0000208 f9 ca 67 cc a8 c7 f8 5b a8 4c 79 03 0c 2b 3d e2
< 0000224 18 f8 6d b3 a9 09 01 d5 df 45 c1 4f 26 fe df b3
< 0000240 dc 38 e9 6a c2 2f e7 bd 72 8f 0e 45 bc e0 46 d2
< 0000256 3c 57 0f eb 14 13 98 bb 55 2e f5 a0 a8 2b e3 31
< 0000272 fe a4 80 37 b8 b5 d7 1f 0e 33 2e df 93 ac 35 00
< 0000288 eb 4d dc 0d ec c1 a8 64 79 0c 78 2c 76 21 56 60
< 0000304 dd 30 97 91 d0 6b d0 af 3f 98 cd a4 bc 46 29 b1
> 0000192 7f 46 dc 93 a6 b6 7e 01 3b 02 9a aa 1d b2 56 0b
> 0000208 45 ca 67 d6 88 c7 f8 4b 8c 4c 79 1f e0 2b 3d f6
> 0000224 14 f8 6d b1 69 09 01 c5 6b 45 c1 53 0a fe df b7
> 0000240 60 38 e9 72 72 2f e7 ad 72 8f 0e 49 04 e0 46 c2
> 0000256 30 57 0f e9 d4 13 98 ab e1 2e f5 bc 94 2b e3 35
> 0000272 42 a4 80 2d 98 b5 d7 0f 2a 33 2e c3 7f ac 35 14
> 0000288 e7 4d dc 0f 2c c1 a8 74 cd 0c 78 30 5a 21 56 64
> 0000304 61 30 97 89 60 6b d0 bf 3f 98 cd a8 04 46 29 a1

This show me that the files only differ in the 128 bytes from offset 192 to offset 319. (All offsets in this post are zero based. The first byte is byte number 0, not byte number 1.)

I’m a little surprised that this critical part starts at byte 192, which is not a multiple of the 160-bit (20 byte) block size of SHA-1. But it obviously works, so what do I know? The critical part does end at a block boundary. [Edit: The SHA-1 block size is actually 128 bits. I was aware that the hash and block sizes do not have to be the same, but I mistakenly thought they were the same for SHA-1.]

Given how most hash functions work, any bytes after the last differing block do not matter, as long as they are the same in both files. For example, the first 320 bytes of the files by themselves should have the same SHA-1 hash as each other (though not the same hash as the full files). I’ll verify that:

$ head -c 320 shattered-1.pdf > head1.tmp
$ head -c 320 shattered-2.pdf > head2.tmp
$ sha1sum head1.tmp head2.tmp
f92d74e3874587aaf443d1db961d4e26dde13e9c head1.tmp
f92d74e3874587aaf443d1db961d4e26dde13e9c head2.tmp


Now I’ll use my “Deark” utility to scan the original files for things that look like embedded JPEG files:

$ deark -m jpegscan -o file1 -d shattered-1.pdf
DEBUG: Found possible JPEG file at 149
DEBUG: length=421385
Writing file1.000.jpg
$ deark -m jpegscan -o file2 -d shattered-2.pdf
DEBUG: Found possible JPEG file at 149
DEBUG: length=186896
Writing file2.000.jpg

Fortunately, that seems to have worked. Both PDF files have a JPEG file embedded at offset 149, which is before the start of the critical data at 192. In both cases, the JPEG content continues long past the end of the critical data. So from here on, we can just look at the JPEG files, and ignore the PDF container.

Recall that the differing bytes are at offset 192-319 in the PDF file. Subtract 149, so in the JPEG file they are at offset 43-170.

It is interesting that the extracted JPEG files are such different sizes, 421385 bytes versus 186896. JPEG format does not have a field that tells the file size. A JPEG decoder just keeps reading until it find a special end-of-file marker. After the first 171 bytes, the files are identical. Yet, somehow, my JPEG scanner found end-of-file markers in very different places, hundreds of thousands of bytes later.

What I’m thinking is that the files are doing some “frame shift” trickery. The identical parts of the JPEG files are contrived to have two different valid paths that a JPEG parser can follow, leapfrogging each other through the file. Some difference near the beginning of the files selects the path that will be followed all the way to the end of the file.

Now I’ll examine the files using Deark:

$ deark -d file1.000.jpg
DEBUG: marker 0xd8 (SOI: Start of image) at 0
DEBUG: segment 0xfe (COM: Comment) at 2, dpos=6, dlen=34
DEBUG:  comment: "SHA-1 is dead!!!!!<85>/<EC><09>#9u<9C>9<B1>..."
DEBUG: segment 0xfe (COM: Comment) at 40, dpos=44, dlen=369
DEBUG:  comment: "F<DC><91>f<B6>~<11><8F><02><9A><B6>..."
DEBUG: segment 0xfe (COM: Comment) at 413, dpos=417, dlen=250
$ deark -d file2.000.jpg
DEBUG: Input file: file2.000.jpg
Module: jpeg
DEBUG: marker 0xd8 (SOI: Start of image) at 0
DEBUG: segment 0xfe (COM: Comment) at 2, dpos=6, dlen=34
DEBUG:  comment: "SHA-1 is dead!!!!!<85>/<EC><09>#9u<9C>9<B1>..."
DEBUG: segment 0xfe (COM: Comment) at 40, dpos=44, dlen=381
DEBUG:  comment: "F<DC><93><A6><B6>~<01>;<02><9A><AA>..."
DEBUG: segment 0xe0 (APP0) at 425, dpos=429, dlen=14
DEBUG:  app id: "JFIF", identified as: JFIF

A JPEG file can contain one or more “comment” segments, containing arbitrary data that is usually ignored. Both files have a comment segment at offset 2 through 39. This has to be the same in both files, since the files are identical until byte 43. This comment ends with 16 bytes of seemingly random data. I assume this is an extra scratch area of some sort, to give the collision search algorithm more places to start, or something like that.

This is followed by a second comment, and this second comment is the key to everything. The comment length is given by a 2-byte integer field, the high byte of which is set to 1. The low byte is set to… Well, it’s exactly here, in the middle of a comment length field, that we cross the border into the bytes that differ between the two files.

The differing bytes, remember, were found by a computer search, and were not really under the researchers’ control. The researchers had to live with them, whatever they turned out to be.

The high byte of the comment length was chosen to be 1, so that the comments’ lengths would be between about 256 and 512 bytes. So, no matter what the low bytes turned out to be, each comment would be big enough to contain the remaining 127 critical bytes.

The low bytes are the first bytes of the hex dumps above: 0x73 and 0x7f. Combined with the high byte, this gives length fields of 0x173 (371) and 0x17f (383). (For technical reasons, the actual comment length is two bytes less than this value.)

A parser of File2, seeing the 0x17f comment, finds normal JPEG data after it. But a parser of File1, seeing the 0x173 comment, finds another comment after it, containing what looks suspiciously like real JPEG data. File1 “comments out” the live JPEG data in File2. A File1 parser does eventually find real JPEG image data, after a large amount of comment data, right after what a File2 parser sees as the end of the file.

That would pretty much be the end of the story, but there’s a snag: JPEG’s annoying 64KB segment length limit. A comment segment can be at most 64KB, but the image the researchers chose to use is quite a bit larger than 64KB. It won’t fit into a comment segment. They couldn’t just use multiple comments, because the extra File1 comment headers would inevitably corrupt File2’s image data. How, then, did they safely comment out File2’s large image in File1?

What they did was to encode File2 as a progressive JPEG. Progressive JPEG is a standard type of JPEG in which the image data is divided up into multiple “scans”, which in this case are all less than 64KB. You can’t put comments in the middle of a scan, but you can put them between two scans. File1 has big comments, each containing one File2 scan. Between the scans in File2 are little comments that comment out the headers of the big comments in File1.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s