Hi all, I am evaluating Apache Flink for processing large sets of Geospatial data. The use case I am working on will involve reading a certain number of GPX files stored on Amazon S3.
GPX files are actually XML files and therefore cannot be read on a line by line basis. One GPX file will produce one or more Java objects that will contain the geospatial data we need to process (mostly a list of geographical points). To cover this use case I tried to extend the FileInputFormat class: public class WholeFileInputFormat extends FileInputFormat<String> { private boolean hasReachedEnd = false; public WholeFileInputFormat() { unsplittable = true; } @Override public void open(FileInputSplit fileSplit) throws IOException { super.open(fileSplit); hasReachedEnd = false; } @Override public String nextRecord(String reuse) throws IOException { // uses apache.commons.io.IOUtils String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8); hasReachedEnd = true; return fileContent; } @Override public boolean reachedEnd() throws IOException { return hasReachedEnd; } } This class returns the content of the whole file as a string. Is this the right approach? It seems to work when run locally with local files but I wonder if it would run into problems when tested in a cluster. Thanks in advance. Andrea. -- Andrea Cisternino, Erlangen, Germany GitHub: http://github.com/acisternino GitLab: https://gitlab.com/u/acisternino