Reading whole files (from S3)

Andrea Cisternino Tue, 07 Jun 2016 07:35:58 -0700

Hi all,

I am evaluating Apache Flink for processing large sets of Geospatial data.
The use case I am working on will involve reading a certain number of GPX
files stored on Amazon S3.


GPX files are actually XML files and therefore cannot be read on a line by
line basis.
One GPX file will produce one or more Java objects that will contain the
geospatial data we need to process (mostly a list of geographical points).

To cover this use case I tried to extend the FileInputFormat class:

public class WholeFileInputFormat extends FileInputFormat<String>
{
  private boolean hasReachedEnd = false;

  public WholeFileInputFormat() {
    unsplittable = true;
  }

  @Override
  public void open(FileInputSplit fileSplit) throws IOException {
    super.open(fileSplit);
    hasReachedEnd = false;
  }

  @Override
  public String nextRecord(String reuse) throws IOException {
    // uses apache.commons.io.IOUtils
    String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8);
    hasReachedEnd = true;
    return fileContent;
  }

  @Override
  public boolean reachedEnd() throws IOException {
    return hasReachedEnd;
  }
}

This class returns the content of the whole file as a string.

Is this the right approach?
It seems to work when run locally with local files but I wonder if it would
run into problems when tested in a cluster.

Thanks in advance.
  Andrea.

-- 
Andrea Cisternino, Erlangen, Germany
GitHub: http://github.com/acisternino
GitLab: https://gitlab.com/u/acisternino

Reading whole files (from S3)

Reply via email to