I'm seeing a problem using the S3AFileSystem with the ParquetInputFormat that 
causes a non-transient EOF for certain files. I have traced what looks like the 
source of the problem to the use of the "random" input policy in order to 
support seek behavior required by Parquet.

I've written a sample program that illustrates the problem given a path in S3 - 
not using Parquet, works on any file > 1024K:

final Configuration conf = new Configuration();
conf.set("fs.s3a.readahead.range", "1K");
conf.set("fs.s3a.experimental.input.fadvise", "random");

final FileSystem fs = FileSystem.get(path.toUri(), conf);

// forward seek reading across readahead boundary
try (FSDataInputStream in = fs.open(path)) {
    final byte[] temp = new byte[5];
    in.readByte();
    in.readFully(1023, temp); // <-- works
}

// forward seek reading from end of readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1024, temp); // <-- throws EOFException
}

I'm wondering two things:
- is this a known problem that I simply haven't found a ticket or question for 
- if not, what are the steps to discuss/contribute a fix (I have a potential 
solution in S3AInputStream.seekInStream) - is the random inputpolicy not 
expected to work fully - as it stands seek, especially backwards seek against 
s3 seems - different? - although for certain use cases it could prevent having 
to download the entire file to local storage
Regards, Dave

Reply via email to