FTP, "Apr 26 02:xx", and you
Most of us who work with computers have stories of times we've worked on an utterly baffling problem. This is one of mine.
If you found this page trying to research a problem you're having with an FTP client failing: skip to the end - I figured it out!
The affliction
For a while I've been maintaining a machine that, among other things, receives collections of files from various sources, processes them, and makes them available for another system (which we'll call "Heiress") to download.
The machine I maintain hosts an FTP server that Heiress is supposed to poll for new files, download those files when it sees them, and then delete them when it's finished. It calls the FTP directory that it polls its "Inbox."
Heiress is, to put it diplomatically, a little flaky. It's scheduled for replacement, but in the meantime it needs to stay working. I log in on to check on it and I'm not too surprised to see that it's stopped downloading files from its Inbox.
I can't see what's wrong with it, so I call the Heiress vendor's tech support people. They decide, as they usually do, to reboot the system. This typically gets things moving again, but this time it doesn't - after rebooting Heiress still isn't processing the Inbox.
After a while we decide to clear out the Inbox (which now has a lot of files waiting to be downloaded). We do that after a few minutes everything starts working properly again - Heiress sees a new file come in, downloads it, and then deletes it from the Inbox. We submit the backlog of files for re-processing and leave.
False hope
I check in a few hours later and find that things are hung up again. I call up tech support again and instead of rebooting again we try what worked last time - clearing out the Inbox. It works again this time too.
We look to see what was in the Inbox the first time and compare it what was there the second time. There's one that was present both times (call it evil.mpg), which is suggestive.
We decided to do an experiment - with everything working normally we place evil.mpg in the Inbox. As we suspected - Heiress hangs up right away.
I'm curious about what's wrong with the file, but I'm glad we found the culprit. I head off to bed, thinking I'll examine what makes evil.mpg so evil the next day.
The makings of a mystery
I log in the next morning and find the system stuck again. A different file seems to be hanging it up.
Uh oh.
I start watching the FTP server logs for the calls the Heiress system was making to see what's going on. I find that with "good" files in the Inbox directory the server logs look like this:
Heiress (172.18.224.5)> 230 Logged on
...
Heiress (172.18.224.5)> LIST /
Heiress (172.18.224.5)> 150 Opening data channel for directory list.
Heiress (172.18.224.5)> 226 Transfer OK
...
Heiress (172.18.224.5)> RETR //good.mpg
Heiress (172.18.224.5)> 150 Opening data channel for file transfer.
Heiress (172.18.224.5)> 226 Transfer OK
That is, the Heiress logs in and asks for a directory listing. It sees files in the directory and RETRieves them.
With "bad" files in the Inbox directory the server logs look like this:
Heiress (172.18.224.5)> 230 Logged on
...
Heiress (172.18.224.5)> LIST /
Heiress (172.18.224.5)> 150 Opening data channel for directory list.
Heiress (172.18.224.5)> 226 Transfer OK
...
Heiress (172.18.224.5)> 421 Connection timed out.
That is, Heiress does the same directory listing, but doesn't do anything with it. Eventually the connection times out without Heiress having downloaded any files.
This implies is that it doesn't matter what the contents of the "bad" files are - they're never downloaded. So... what could be bad about them?
Ridiculous fishing
Since the directory listing is the last thing we see, I suppose that my machine (the server) is screwing up when serving it up, or that Heiress (the client) is screwing up in interpreting it.
I can't really do anything with the client by myself, and the Heiress tech support people aren't awake yet. I decide to focus on the server first.
When confronted with a problem I don't understand I try to think of what people smarter than me would do (I got this idea from Ridiculous Fish). I decide to ask mental simulations of:
- My boss Gregg
- My friend Ryan
- Dr. House
Mental-Gregg says "What are you doing? The client was written by monkeys and must be causing the problem! Get things going again and tell the Heiress people to go fix it." I decide he is probably right, but if he's not then I'll have to figure out the server side anyway.
Mental-Ryan says "Did you try typing your problem into Google?" I find this to be a compelling idea. I spend some time researching the 421 timeout message, but don't come up with anything useful.
Mental-House says "Everybody lies. The server says 226 Transfer OK, but it's wrong. It's sick. Find out what it has." Hmm...
I think maybe there is a problem with the storage on the server - maybe certain files are stored on a disk with bad sectors, and this is causing some sort of issue when their metadata is accessed for the directory listing?
This is just the sort of thing House would come up with, so I download and install the advanced RAID monitor for the server and check its disks.
Everything looks fine.
The picture of health
"You idiot," says Mental-House, "Don't just guess a diagnosis. Look closer at the symptoms."
I open up an ftp client, log in to the server as the Heiress user, and request a listing. I get back:
-rw-r--r-- 1 ftp ftp 4214396 Apr 26 11:01 good.mpg
-rw-r--r-- 1 ftp ftp 6896446864 Apr 26 02:27 evil.mpg
drwxr-xr-x 1 ftp ftp 0 Apr 26 12:22 reportdir
drwxr-xr-x 1 ftp ftp 0 Apr 26 23:25 tempdir
Nothing stands out here - evil.mpg is the largest file in the directory, but it's small compared to other files that have worked fine.
Before House can call me an idiot again for not directly observing the failure I launch Wireshark and capture of the Heiress system making its LIST call.
Everything looks the same as my test. Not just kind of the same, but byte for byte the same.
I start to tell Mental-House that the server isn't sick, but he angrily gulps some pills and hobbles away.
More false hope
Mental-Gregg declares "I told you it's the client's fault. But something the server is doing might be stimulating bad behavior from the client. Figure out the stimulus and you can tell the monkeys what to fix. Then we'll bill them."
The stimulus must be in the directory listing, I decide, so I start looking through what's there.
The most important thing about the listing is the file name, but surely that's not the issue. Just to make sure I change evil.mpg's name to not_evil.mpg. There's no effect, of course.
The permissions for good.mpg and evil.mpg are the same (-rw-r--r-- 1), so that can't be the problem. The user and group names are the same too. Also, I can't change them - Filezilla sets them both to ftp.
I'm trying to think of what might be weird about evil.mpg's file size, and then I remember something important! These are MPEG-TS files, and they're supposed to be multiples of 188 bytes. Maybe the bad files aren't encoded properly and the client is rejecting them based on that constraint...!
My hopes are quickly dashed:
>>> print("good.mpg's size % 188 = {}".format(4214396 % 188))
good.mpg's size % 188 = 0
>>> print("evil.mpg's size % 188 = {}".format(6896446864 % 188))
evil.mpg's size % 188 = 0
Wait, what?
I've got one column of directory listing left - the file modification timestamp.
On Unix this would be easy to change - just run the touch command on the target file. However this is Windows, so I have to do some weird hack. I change the timestamp for evil.mpg and put it back in the Inbox. And... everything starts working!
I'm happy, but... confused. How can the timestamp be the killer?
evil.mpg's timestamp is Apr 26 02:27. The other "bad" file I found has a timestamp of Apr 26 02:12. These have Apr 26 02 in common, but what's weird about that?
I make up some test files with timestamps between Apr 26 02:00 and Apr 26 02:59. Every one of them causes the system to fail. Everything is fine with Apr 26 01:59 and Apr 26 03:00.
I'm nearly certain the system has been handling files modified between 2 and 3 AM just fine for months. So what's happened? I'm almost afraid to try it, but I make up a file with timestamp Apr 27 02:30. It works, thankfully.
This is good! I can identify which files are bad and also fix them. But I'm not satisfied... I know more, but this is just as mystifying a puzzle as ever.
From a non-linear, non-subjective point of view...
What's special about the two o'clock hour on April 26th?
My first thought is that this is some Daylight Saving Time issue. However, that transition was weeks ago - DST starts on the second Sunday in March.
I of course remember that the rules for DST in the U.S. changed in 2007. DST used to start on the first Sunday in April. But we're past that too.
Mental-Ryan chimes in - "You might have something you can Google now." I'm skeptical that this will produce results, but I type in "apr 26 02" ftp.
There's one result - an Apache mailing list post titled listFiles failing to parse Apr 26 02:45 like dates. Hey, at least I'm not alone!
The post tells the story - I had naively assumed that DST start times hadn't changed besides in 2007. Wrong - from 1967 to 1973 DST started on the last Sunday in April.
Why would that matter? Well... in Unix timestamps are measured in seconds since midnight on January 1st, 1970 (pretty much). The timestamps in the listing don't include a year (Apr 26 02:27), so the system interprets them as for 1970.
Was April 26 the last Sunday in 1970? Yes it was:
>>> datetime.datetime(1970, 4, 26).strftime("%A")
'Sunday'
So now we know the problem - there was no two o'clock hour on April 26th, 1970 - the clock went from 01:59 to 03:00.
You generally want your library to prevent you from doing things like "April 31st, 2014," but arguably there's a bug in how Java's FTPClient (which the monkeys used to write Heiress) interprets FTP server listings - why would we not want to download a file with a bogus timestamp?
Wrapping up
I'm exhilarated. It took several hours of arguing with mental projections, but I cracked it.
I go to log out, but Mental-Gregg says "You still have to fix the bad files."
I write up a script to "touch" all the files that were uploaded between 2014-04-26 02:00 and 02:59.
Unless a disaster occurs Heiress will be decommissioned before next year's April 26th and I won't ever have to deal with this again.