Potential block size issue with S3 binary files

Ken Krugler Wed, 28 Aug 2019 19:50:02 -0700

Hi all,

Wondering if anyone else has run into this.


We write files to S3 using the SerializedOutputFormat<OurCustomPOJO>. When we 
read them back, sometimes we get deserialization errors where the data seems to 
be corrupt.

After a lot of logging, the weathervane of blame pointed towards the block size 
somehow not being the same between the write (where it’s 64MB) and the read 
(unknown).

When I added a call to SerializedInputFormat.setBlockSize(64MB), the problems 
went away.

It looks like both input and output formats use fs.getDefaultBlockSize() to set 
this value by default, so maybe the root issue is S3 somehow reporting 
different values.

But it does feel a bit odd that we’re relying on this default setting, versus 
it being recorded in the file during the write phase.

And it’s awkward to try to set the block size on the write, as you need to set 
it in the environment conf, which means it applies to all output files in the 
job.

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Potential block size issue with S3 binary files

Reply via email to