About that JPEG/ZIP/Shakespeare hybrid file

The other day, a Twitter user (David Buchanan, @David3141593) posted a message that gained some attention. It has an attached JPEG file, with an image of William Shakespeare. If you save a copy of the JPEG file, and unzip it as if it were a ZIP file, it unzips into the complete works of Shakespeare.

Here’s what the image looks like. I stripped out the Shakespeare stuff, because I don’t want everyone who reads this post to have to download it.

shakespearezip

He correctly claimed that the hidden content “survives all of Twitter’s scaling, compression, and thumbnailing”. He also said “/how/ is left as an exercise to the reader”.

Okay, I’ll give it a shot.

Preliminary thoughts

How impressive is this? Creating a hybrid ZIP file can be quite easy. To make a ZIP file, you only need to control some data near the end of the file. A lot of formats, including JPEG, will still work fine if you append random junk to the end of them. So you can pretty much just concatenate JPEG and ZIP data together, and get a hybrid file that works as either format. But there are a couple of problems:

  • The file offsets contained in the ZIP portion of the file will be wrong, unless you adjust them. Some unzip programs, some of the time, will still work if you don’t adjust them, but others will not. I discussed this very issue in a previous post.
  • Twitter and other JPEG scaling/sanitizing/whatever processes are highly unlikely to retain random junk at the end of a JPEG file.

To survive Twitter’s filtering, the custom data needs to be inside the main part of the JPEG file, not appended to it. And that could still be easy, except:

  • Twitter might strip it out anyway, if it isn’t on some Twitter whitelist of kinds-of-data-to-keep.
  • A ZIP signature must appear “near” the end of the file, but the last part of the JPEG file is used for the (post-processed) JPEG image data, which you don’t really control. The definition of “near” depends on what unzip utility will be used, which you also don’t really control. Long story short, this sort of thing will work better with small images, say smaller than about 64KB, so that the ZIP signature can appear before the image, while still being near enough to the end of the file.
  • Due to bad design by the JPEG format inventors, a JPEG metadata segment cannot be larger than about 64KB. You can have multiple segments, but there will necessarily be gaps between the payload data of the different segments. That’s a problem, because Shakespeare wrote more than 64KB worth of stuff, and because within a ZIP file, a member file cannot have any gaps in its compressed data representation. JPEG requires gaps, while ZIP forbids them. This conflict is really what makes this hybrid file interesting.

How well does it work?

I saved a copy of the file, and got something named DqteCf6WsAAhqwV.jpg. In this post, I’m going to include file sizes and MD5 hashes, so that anyone following along can verify that we’re on the same page. So:

DqteCf6WsAAhqwV.jpg size=2033782 md5=0e7dd647d856a3fcd89c480dabbf00e5

Using the standard Info-ZIP “unzip” utility:

$ unzip DqteCf6WsAAhqwV.jpg
Archive: DqteCf6WsAAhqwV.jpg
error [DqteCf6WsAAhqwV.jpg]: missing 454 bytes in zipfile
(attempting to process anyway)
extracting: shakespeare.part001.rar
extracting: shakespeare.part002.rar
extracting: shakespeare.part003.rar
...
extracting: shakespeare.part030.rar
extracting: shakespeare.part031.rar

It reports an error, then extracts 31 “.rar” files, apparently successfully. See my previous post on corrupted ZIP files for an explanation of the error.

shakespeare.part001.rar size=64512 md5=dc5b45b3a6c6f5da240d490d612d65bf
shakespeare.part002.rar size=64512 md5=2b0ada8e72264f2a34167e1bf3a655f8
shakespeare.part003.rar size=64512 md5=2513d5e3b5e44366bd00404599577a27
...
shakespeare.part030.rar size=64512 md5=551c3aa0b3405f198adfdb56a40a5705
shakespeare.part031.rar size=2837  md5=53e7ebacca8ad52bf02bc1f19eebf002

So far, it works, though not as well as one might hope. Not all unzip programs can handle it.

The .rar files are in RAR (Roshal Archive) format, which is a fairly popular compressed file format in some circles, but it’s not one that I know all that much about. It is supported by the free 7-Zip software, but 7-Zip apparently does not support this particular flavor of RAR file.

I then tried the official unrar utility on the first RAR file. It found all the other files, and combined and decompressed them…

$ unrar e shakespeare.part001.rar
Extracting from shakespeare.part001.rar
Extracting  shakespeare.html                     3%
Extracting from shakespeare.part002.rar
...         shakespeare.html                     6%

...

Extracting from shakespeare.part030.rar
...         shakespeare.html                    99%
Extracting from shakespeare.part031.rar
...         shakespeare.html                    OK 
All OK

…into a single file named shakespeare.html.

shakespeare.html size=7033657 md5=d070e9815caad2d8b66491520775d5b7

This HTML file does appear to be a valid file containing the compete works of Shakespeare.

Preliminary analysis

My own file analysis utility, named Deark, is, in my very humble opinion, a good place to start with this kind of analysis. (So it turns out that this post is really just an advertisement in disguise. Sorry.) I’ll run it on the .jpg file, and see what happens.

$ deark DqteCf6WsAAhqwV.jpg
Module: jpeg
Writing output.000.icc
Format: JPEG/JFIF

It thinks it’s a JPEG file, and it extracts an ICC profile. An ICC profile is a file or file component that defines in detail how to interpret an image’s colors.

output.000.icc size=2031150 md5=0e52895d87fa76431192fb9aeb18a502

I can use the “-d” option to get more details about the JPEG file’s layout, and it appears to be quite normal. Deark would tell me if there were extraneous data at the end of the file, and there isn’t.

Now I’ll have Deark interpret it as a ZIP file. (The -l option is to prevent it from creating output files. While Deark can decompress some ZIP files, including this one, it’s more useful for analyzing them.)

$ deark -m zip -l DqteCf6WsAAhqwV.jpg
Module: zip
Format: ZIP
Warning: Inconsistent central directory offset. Reported to be 1969492, but based on its reported size, it should be 1969038.
Warning: Local file header found at 182 instead of 636. Assuming offsets are wrong by -454 bytes.
output.000.shakespeare.part001.rar
output.001.shakespeare.part002.rar
output.002.shakespeare.part003.rar
...
output.029.shakespeare.part030.rar
output.030.shakespeare.part031.rar

This mainly just confirms what we already learned from the unzip utility.

In-depth analysis

ICC profiles are always stored uncompressed and otherwise unencoded in JPEG files, so the 2031150 bytes of ICC profile data really do take up 99.87% of the 2033782 bytes in this file — though note that these bytes cannot all be contiguous. The actual JPEG image is in the remaining 0.13%. This gives us a rather large clue as to where Shakespeare’s works are.

Every ZIP file contains the signature byte sequence PK\5\6, normally very close to the end of the file. In this file it can be found at offset 1971177, which is an uncommonly large 62605 bytes from the end of the file.

[DqteCf6WsAAhqwV.jpg]
1971168 72 74 30 33 31 2e 72 61 72 50 4b 05 06 00 00 00  >rt031.rarPK.....<
1971184 00 1f 00 1f 00 5b 08 00 00 54 0d 1e 00 00 00 41  >.....[...T.....A<
1971200 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41  >AAAAAAAAAAAAAAAA<

This data also makes up part of the ICC profile we extracted (at an offset that we don’t care about, because the reconstructed ICC profile is never interpreted as a ZIP file).

The ICC profile specification defines a standard way to split up an ICC profile and store it in multiple JPEG segments, to get around the 64KB size limit. There is still a maximum size of about 16MB, but that’s easily enough for this particular project.

Deark (with -d option) shows me that the ICC profile is split up into 32 segments, in the standard way. Each one, except the last, is the maximum size possible for such a segment. I do not know whether Twitter’s processing ever changes the size of these segments. If an ICC profile segment size were to be changed, the secret ZIP data would be corrupted. Making them the maximum allowed size gives them the best chance of surviving unmolested.

My pretty-well-confirmed-at-this-point theory is that creator compressed the works of Shakespeare as a multi-part RAR archive, with each part being small enough to store in a JPEG ICC profile fragment. He then did some carefully calculated strategizing in order to put them there, in a way that would be valid and seemingly normal in a JPEG file.

While there are certainly other kinds of JPEG segments that could be used for this sort of thing, ICC profile seems like a good choice. It’s standard, it works around the 64KB size limit, it’s easy to store arbitrary data inside it, and image processing systems often consider it important enough to copy it unchanged.

When interpreted as an actual ICC profile, the contents of this ICC profile shouldn’t matter too much, provided they are sane enough not to cause the Bard’s skin to turn purple or something. Examining it:

$ deark -d output.000.icc
DEBUG: Input file: output.000.icc
Module: iccprofile
Note: ICC profiles can be parsed, but no files can be extracted from them.
DEBUG: header at 0
DEBUG:  profile size: 2031150
DEBUG:  preferred CMM type: 0x3c212d2d='<!--'
DEBUG:  profile version: 4.0.0
DEBUG:  profile/device class: 0x6d6e7472='mntr'
DEBUG:  colour space: 0x52474220='RGB '
DEBUG:  PCS: 0x58595a20='XYZ '
DEBUG:  file signature: 0x61637370='acsp'
DEBUG:  primary platform: 0x00000000=(none)
DEBUG:  device manufacturer: 0x00000000=(none)
DEBUG:  device model: 0x00000000=(none)
DEBUG:  rendering intent: 0 (perceptual)
DEBUG:  profile creator: 0x00000000=(none)
DEBUG: tag table at 128
DEBUG:  number of tags: 1
DEBUG:  expected start of data segment: 144
DEBUG:  tag #0 'lmao' (?) offs=144 dlen=2031006
DEBUG:   data type: 'PK<03><04>' (?)
No files found to extract!

The ICC profile header looks fairly normal (for all I know), but typically there will be a number of “tags” containing color translation formulas and whatnot. This profile has no standard tags. It just has one tag, whose logical type is “lmao” (hmm…), and whose structural type is alleged to be “PK\3\4”, both of which Deark is unfamiliar with. It has about 2MB of data associated with it. PK\3\4 is the signature found at the beginning of most ZIP files, so apparently we’ve found some ZIP data.

The CMM type is odd, but it is not relevant to this analysis.

There are 32 ICC profile segments, but only 31 RAR files. What is the 32nd segment for? It stands to reason that it’s for the ZIP “central directory”, which is the 32nd and final piece of the ZIP puzzle that needs to be present. It stands to reason, but it’s not true. The central directory is actually in the middle of the 31st segment, right after the end of the RAR data. Everything after it, and everything in the 32nd segment, is unused padding. I doubt that there was any important strategy behind this.

Putting it together

At the lowest level, each ICC profile segment (except the last one) is 65537 bytes in size, which is the maximum size possible for a JPEG segment. Breaking it down:

  • Start with 65537 bytes, the total size of each JPEG-ICC application segment. Minus 4 bytes of low-level JPEG segment headers, leaves…
  • 65533 bytes of application data in the JPEG-ICC segment. Minus some headers that identify the segment as an ICC profile fragment, and help to properly interpret it, leaves…
  • 65519 bytes of ICC profile (fragment) payload data. Minus some space for the ICC profile header and tag table (only needed in the first segment, but reserved in all of them), leaves…
  • 64565 bytes of ZIP member (header + “compressed” file contents) data. Minus the ZIP member local file header (signature, file attributes, RAR filename, etc.), leaves…
  • 64512 bytes for the “compressed” ZIP member file contents. It is not actually compressed at this layer, so we’re left with the same…
  • 64512 bytes for the contents of the RAR “part” file.

Incidentally, Deark does not yet know anything about RAR format, beyond how to recognize it:

$ deark shakespeare.part001.rar
Module: unsupported
Error: This looks like a RAR archive, which is not a supported format.

That’s about all I have to say about the Shakespeare file. I know this write-up might not be as well edited as it could have been. But I hope it helps explain what is and isn’t impressive about such a feat.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s