You can use Mahout XMLInputFormat with Flink - HAdoopInputFormat definitions. See
<goog_121160879> http://stackoverflow.com/questions/29429428/xmlinputformat-for-apache-flink http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-td7023.html On Tue, Jun 7, 2016 at 10:11 PM, Jamie Grier <ja...@data-artisans.com> wrote: > Hi Andrea, > > How large are these data files? The implementation you've mentioned here > is only usable if they are very small. If so, you're fine. If not read > on... > > Processing XML input files in parallel is tricky. It's not a great format > for this type of processing as you've seen. They are tricky to split and > more complex to iterate through than simpler formats. However, others have > implemented XMLInputFormat classes for Hadoop. Have you looked at these? > Mahout has an XMLInputFormat implementation for example but I haven't used > it directly. > > Anyway, you can reuse Hadoop InputFormat implementations in Flink > directly. This is likely a good route. See Flink's HadoopInputFormat > class. > > -Jamie > > > On Tue, Jun 7, 2016 at 7:35 AM, Andrea Cisternino <a.cistern...@gmail.com> > wrote: > >> Hi all, >> >> I am evaluating Apache Flink for processing large sets of Geospatial data. >> The use case I am working on will involve reading a certain number of GPX >> files stored on Amazon S3. >> >> GPX files are actually XML files and therefore cannot be read on a line >> by line basis. >> One GPX file will produce one or more Java objects that will contain the >> geospatial data we need to process (mostly a list of geographical points). >> >> To cover this use case I tried to extend the FileInputFormat class: >> >> public class WholeFileInputFormat extends FileInputFormat<String> >> { >> private boolean hasReachedEnd = false; >> >> public WholeFileInputFormat() { >> unsplittable = true; >> } >> >> @Override >> public void open(FileInputSplit fileSplit) throws IOException { >> super.open(fileSplit); >> hasReachedEnd = false; >> } >> >> @Override >> public String nextRecord(String reuse) throws IOException { >> // uses apache.commons.io.IOUtils >> String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8); >> hasReachedEnd = true; >> return fileContent; >> } >> >> @Override >> public boolean reachedEnd() throws IOException { >> return hasReachedEnd; >> } >> } >> >> This class returns the content of the whole file as a string. >> >> Is this the right approach? >> It seems to work when run locally with local files but I wonder if it >> would >> run into problems when tested in a cluster. >> >> Thanks in advance. >> Andrea. >> >> -- >> Andrea Cisternino, Erlangen, Germany >> GitHub: http://github.com/acisternino >> GitLab: https://gitlab.com/u/acisternino >> > > > > -- > > Jamie Grier > data Artisans, Director of Applications Engineering > @jamiegrier <https://twitter.com/jamiegrier> > ja...@data-artisans.com > >