Hi,
setting the unsplittable attribute in the constructor is fine. The field's
value will be send to the cluster.
So what happens is that you initialize the input format in your client
program. Then, its serialized, send over the network to the machines and
deserilaized again. So the value you've
Hi,
I am replying to myself for the records and to provide an update on what I
am trying to do.
I have looked into Mahout's XmlInputFormat class but unfortunately it
doesn't solve my problem.
My exploratory work with Flink tries to reproduce the key steps that we
already perform in a quite
Jamie, Suneel thanks a lot, your replies have been very helpful.
I will definitely take a look at XMLInputFormat.
In any case the files are not very big: on average 100-200kB up to a max of
a couple of MB.
On 8 June 2016 at 04:23, Suneel Marthi wrote:
> You can use Mahout
You can use Mahout XMLInputFormat with Flink - HAdoopInputFormat
definitions. See
http://stackoverflow.com/questions/29429428/xmlinputformat-for-apache-flink
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-td7023.html
On Tue, Jun 7, 2016 at 10:11 PM, Jamie
Hi Andrea,
How large are these data files? The implementation you've mentioned here
is only usable if they are very small. If so, you're fine. If not read
on...
Processing XML input files in parallel is tricky. It's not a great format
for this type of processing as you've seen. They are
Hi all,
I am evaluating Apache Flink for processing large sets of Geospatial data.
The use case I am working on will involve reading a certain number of GPX
files stored on Amazon S3.
GPX files are actually XML files and therefore cannot be read on a line by
line basis.
One GPX file will produce