Being fooled by randomness
This is fine
"The data stream is compressed," said my collaborator. Your application can read a block of compressed data, do whatever with it, and then read the next block.
"Sounds great," I replied. "What's the compression method?"
"zlib," they answered. "There's a block of compressed data, a separator sequence and then another block, another separator, and so on."
"What's the separator sequence?"
"The ASCII sequence SEPSEPSEP," they said.
Something tickled in the back of my mind. "Is that a good idea?" I asked. "What if the separator shows up in the data?"
"That's pretty unlikely, right? What's the probability of that string occurring in real binary data?"
I thought for a moment. "I think we can estimate this, actually. Well-compressed data is basically random, so isn't it just a 1 in 256 chance for each byte?"
"That sounds right."
"So. 256 to the ninth power, or 2^72?" I stated.
"Pretty unlikely!"
"I guess you're right," I admitted. "It seems wrong to rely on that, though."
"We can change it if you really want," they said. "But it would be a pain to do."
"No, a one in 256^9 chance is works out to once every billion years for the volume we're considering," I said. "We'll just use what you've got."
A billion years later
"Hey," I said. "I think your data stream producer is broken."
"What do you mean? It's pretty simple," my collaborator responded, a little defensively. "Here it is in essence."
>>> import zlib
...
... SEPARATOR = b'SEPSEPSEP'
...
... message_1 = b'Four score and seven years ago'
... message_2 = b'our fathers brought forth on this continent'
... message_3 = b'a new nation, conceived in Liberty'
...
... chunk = b''.join(
... (
... zlib.compress(message_1),
... SEPARATOR,
... zlib.compress(message_2),
... SEPARATOR,
... zlib.compress(message_3),
... )
... )
... send_to_stream(chunk)
"That's about what I expected," I replied. "But I keep getting corrupt messages."
"Well, what does your consumer application look like?"
"Here it is," I said, now defensive myself:
>>> import zlib
... SEPARATOR = b'SEPSEPSEP'
...
... data = get_stream(3)
... compressed_parts = data.split(SEPARATOR)
... [zlib.decompress(p) for p in compressed_parts]
[
b'Four score and seven years ago',
b'our fathers brought forth on this continent',
b'a new nation, conceived in Liberty',
]
"That looks right. What's the problem?"
"Well, those messages were fine. But I keep getting ones that fail to decompress. Here's the hex string for such a message."
534550789c1dcad10980300c04d0556e02773a9a0303
31a96ddc5ff1f3c1631a4ce6832d4317fa14e6aa59db
db2b3fb3c1085c4a700963e9bfba1fc6f10200371775
"Uh, look those first few bytes. 53 45 50, isn't that... S E P in ASCII?"
"Yeah, you're right. I knew it! The separator is appearing in the data! Didn't I warn you about this?"
"Yeah, but then you also said it would only happen once every billion years. That was... yesterday."
"I think we, uh, miscalculated."
Everything is obvious (once you know the answer)
So what went wrong? It's clear in retrospect. We were expecting the data stream to look like this:
first_message | SEP SEP SEP | next_message
Because each message is compressed, and therefore effectively random, we were right that a particular string of nine bytes was unlikely to occur. However, we failed to consider the case where a message ended in the three byte string S E P. What we would want is this:
...SEP | SEP SEP SEP | next_message
But we would get is this:
... | SEP SEP SEP | SEP next_message
That is, we'd split too early, and truncate the first message. The next message would then have a spare SEP prepended to it, so it would be corrupt too.
What we thought was a 256^9 chance of corruption was actually a 256^3 chance of corruption: a message merely needed to reproduce part of the separator sequence. Our billion years turned into a few dozen minutes. Whoops.
Do the right thing
What would have been a better way to produce the compressed stream? There are a number of cromulent answers.
One method would be to take advantage of the fact that the zlib format looks like this:
header | DEFLATE block | DEFLATE block | last DEFLATE block | checksum
Since you can tell which DEFLATE block is the last one (see the DEFLATE stream format), you know when you're at the end of a message. There's no need for separators at all!
A simplified reader based on this method is below:
from io import BytesIO
from zlib import decompressobj
def decompress(chunk, SEGMENT_SIZE=8):
result = bytearray()
with BytesIO(chunk) as f:
collector = bytearray()
decompressor = decompressobj()
while True:
# If there's data in the decompression buffer, grab it.
segment = decompressor.unused_data
if segment:
decompressor = decompressobj()
# Otherwise, read from the "file" and break if we're at the end.
else:
segment = f.read(SEGMENT_SIZE)
if not segment:
break
# Decompress what we read.
result += decompressor.decompress(segment)
return result
Another would be to use gzip instead of plain zlib. It adds some framing on top of zlib, in particular a header and trailer. Most gzip tools allow you to just concatenate compressed blocks together
$ cd /tmp/
$ echo "First" > first.txt
$ echo "Second" > second.txt
$ gzip first.txt
$ gzip second.txt
$ cat first.txt.gz second.txt.gz > combined.txt.gz
$ gunzip combined.txt.gz
$ cat combined.txt
First
Second
A third would be to prepend each compressed block with its length. Then the reader knows exactly how much to read.
from io import BytesIO
from struct import Struct
from zlib import decompress
s = Struct('!Q') # Sizes stored as unsigned 64-bit integers
result = bytearray()
with BytesIO(stream) as f:
while True:
size_bytes = f.read(s.size)
if len(size_bytes) < s.size:
break
read_size, = s.unpack(size_bytes)
compressed_block = f.read(read_size)
if len(compressed_block) < read_size:
break
result += decompress(compressed_block)
In the end
The lesson here is this: never believe anyone when they say "the bad thing probably won't happen."