Re: Reading whole files (from S3)

2016-06-10 Thread Robert Metzger
Hi, setting the unsplittable attribute in the constructor is fine. The field's value will be send to the cluster. So what happens is that you initialize the input format in your client program. Then, its serialized, send over the network to the machines and deserilaized again. So the value you've

Re: Reading whole files (from S3)

2016-06-10 Thread Andrea Cisternino
Hi, I am replying to myself for the records and to provide an update on what I am trying to do. I have looked into Mahout's XmlInputFormat class but unfortunately it doesn't solve my problem. My exploratory work with Flink tries to reproduce the key steps that we already perform in a quite

Re: Reading whole files (from S3)

2016-06-08 Thread Andrea Cisternino
Jamie, Suneel thanks a lot, your replies have been very helpful. I will definitely take a look at XMLInputFormat. In any case the files are not very big: on average 100-200kB up to a max of a couple of MB. On 8 June 2016 at 04:23, Suneel Marthi wrote: > You can use Mahout

Re: Reading whole files (from S3)

2016-06-07 Thread Suneel Marthi
You can use Mahout XMLInputFormat with Flink - HAdoopInputFormat definitions. See http://stackoverflow.com/questions/29429428/xmlinputformat-for-apache-flink http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-td7023.html On Tue, Jun 7, 2016 at 10:11 PM, Jamie

Re: Reading whole files (from S3)

2016-06-07 Thread Jamie Grier
Hi Andrea, How large are these data files? The implementation you've mentioned here is only usable if they are very small. If so, you're fine. If not read on... Processing XML input files in parallel is tricky. It's not a great format for this type of processing as you've seen. They are

Reading whole files (from S3)

2016-06-07 Thread Andrea Cisternino
Hi all, I am evaluating Apache Flink for processing large sets of Geospatial data. The use case I am working on will involve reading a certain number of GPX files stored on Amazon S3. GPX files are actually XML files and therefore cannot be read on a line by line basis. One GPX file will produce