Thanks Doug. In this case I could truncate the logs earlier, but then I have to go back at some point and recombine the small files. For now, I can live with moving the files daily.
I was unable to find a way to trap the "Invalid Sync" (org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) Since my mapper extends AvroMapper, and map throws exceptions, I don't know where to trap it. Another person suggested using low-level avro functions for this. Perhaps I need to write an avro file validator of some sort to be run before the Map/Reduce job? This seems nasty. But I had another M/R job failure for this error over night, and even finding the offending file via the logs is quite a pain. Any suggestions? -Terry On 01/17/2013 04:36 PM, Doug Cutting wrote: > Folks often move files once they're closed into a directory where > they're processed to avoid issues with partially written data. Maybe > you could start a new log file every hour rather than every day? > > We could add an ignoreTruncation or ignoreCorruption option to > DataFileReader that attempts to read files that might be truncated or > corrupted. > > And yes, you can probably just catch those exceptions and exit the map > at that point. > > Doug > > On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <[email protected]> wrote: >> I have a log collection application that writes .avro files within HDFS. >> Ideally I would like to include the current days (open for append) file >> as one of the input files for a periodic M/R job. >> >> I tried this but the Map job exited in error with the dreaded "Invalid >> Sync!" IOException. I guess I should have expected this, but is there a >> reasonable way around it? Can I catch the exception and just exit the >> map at that point? >> >> All suggestions appreciated. >> >> -Terry
