You can use Mahout XMLInputFormat with Flink - HAdoopInputFormat
definitions. See

<goog_121160879>
http://stackoverflow.com/questions/29429428/xmlinputformat-for-apache-flink
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-td7023.html


On Tue, Jun 7, 2016 at 10:11 PM, Jamie Grier <ja...@data-artisans.com>
wrote:

> Hi Andrea,
>
> How large are these data files?  The implementation you've mentioned here
> is only usable if they are very small.  If so, you're fine.  If not read
> on...
>
> Processing XML input files in parallel is tricky.  It's not a great format
> for this type of processing as you've seen.  They are tricky to split and
> more complex to iterate through than simpler formats. However, others have
> implemented XMLInputFormat classes for Hadoop.  Have you looked at these?
> Mahout has an XMLInputFormat implementation for example but I haven't used
> it directly.
>
> Anyway, you can reuse Hadoop InputFormat implementations in Flink
> directly.  This is likely a good route.  See Flink's HadoopInputFormat
> class.
>
> -Jamie
>
>
> On Tue, Jun 7, 2016 at 7:35 AM, Andrea Cisternino <a.cistern...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am evaluating Apache Flink for processing large sets of Geospatial data.
>> The use case I am working on will involve reading a certain number of GPX
>> files stored on Amazon S3.
>>
>> GPX files are actually XML files and therefore cannot be read on a line
>> by line basis.
>> One GPX file will produce one or more Java objects that will contain the
>> geospatial data we need to process (mostly a list of geographical points).
>>
>> To cover this use case I tried to extend the FileInputFormat class:
>>
>> public class WholeFileInputFormat extends FileInputFormat<String>
>> {
>>   private boolean hasReachedEnd = false;
>>
>>   public WholeFileInputFormat() {
>>     unsplittable = true;
>>   }
>>
>>   @Override
>>   public void open(FileInputSplit fileSplit) throws IOException {
>>     super.open(fileSplit);
>>     hasReachedEnd = false;
>>   }
>>
>>   @Override
>>   public String nextRecord(String reuse) throws IOException {
>>     // uses apache.commons.io.IOUtils
>>     String fileContent = IOUtils.toString(stream, StandardCharsets.UTF_8);
>>     hasReachedEnd = true;
>>     return fileContent;
>>   }
>>
>>   @Override
>>   public boolean reachedEnd() throws IOException {
>>     return hasReachedEnd;
>>   }
>> }
>>
>> This class returns the content of the whole file as a string.
>>
>> Is this the right approach?
>> It seems to work when run locally with local files but I wonder if it
>> would
>> run into problems when tested in a cluster.
>>
>> Thanks in advance.
>>   Andrea.
>>
>> --
>> Andrea Cisternino, Erlangen, Germany
>> GitHub: http://github.com/acisternino
>> GitLab: https://gitlab.com/u/acisternino
>>
>
>
>
> --
>
> Jamie Grier
> data Artisans, Director of Applications Engineering
> @jamiegrier <https://twitter.com/jamiegrier>
> ja...@data-artisans.com
>
>

Reply via email to