Hi Sebastian,

I'm not aware of a better way of implementing this in Flink. You could
implement your own XmlInputFormat using Flink's InputFormat abstractions,
but you would end up with almost exactly the same code as Mahout / Hadoop.
I wonder why the decompression with the XmlInputFormat doesn't work. Did
you get any exception?

Regards,
Robert


On Wed, Jan 11, 2017 at 4:31 PM, Sebastian Neef <
gehax...@mailbox.tu-berlin.de> wrote:

> Hi,
>
> what's the best way to read a compressed (bz2 / gz) XML file splitting
> it at a specific XML-tag?
>
> So far I've been using hadoop's TextInputFormat in combination with
> mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the
> plain TextInputFormat can handle compressed data, the XmlInputFormat
> can't for some reason.
>
> Is there a flink-ish way to accomplish this?
>
> Best regards,
> Sebastian
>
> [0]:
> https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33
> fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/
> XmlInputFormat.java
>

Reply via email to