Hi Sebastian, I'm not aware of a better way of implementing this in Flink. You could implement your own XmlInputFormat using Flink's InputFormat abstractions, but you would end up with almost exactly the same code as Mahout / Hadoop. I wonder why the decompression with the XmlInputFormat doesn't work. Did you get any exception?
Regards, Robert On Wed, Jan 11, 2017 at 4:31 PM, Sebastian Neef < gehax...@mailbox.tu-berlin.de> wrote: > Hi, > > what's the best way to read a compressed (bz2 / gz) XML file splitting > it at a specific XML-tag? > > So far I've been using hadoop's TextInputFormat in combination with > mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the > plain TextInputFormat can handle compressed data, the XmlInputFormat > can't for some reason. > > Is there a flink-ish way to accomplish this? > > Best regards, > Sebastian > > [0]: > https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33 > fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/ > XmlInputFormat.java >