Yes we should split BZip2 input at some point but I just have not had
chance to look at this properly.  Code already exists in Hadoop core that
we should be able to crib from - filed as JENA-893 so it doesn't get lost.

Which reader specifically are you using?

The unit tests explicitly cover reading and writing compressed data so all
the readers should be able to read .bz2 files just fine provided you've
configured Hadoop appropriately I.e.

config.set(HadoopIOConstants.IO_COMPRESSION_CODECS,
BZip2Codec.class.getCanonicalName());


I'll add a note on this to the documentation

Rob

On 04/03/2015 09:49, "Azhar Jassal" <[email protected]> wrote:

>Hi
>
>I have began using jena-elephas.
>
>Is there any thought on how to deal with compressed (particularly bzip2)
>input files- bzip2 is splittable.
>
>For illustration, the DBpedia "persondata_en.nq" (release 3.9) is 80mb
>compressed (bzip2) and 1.5gb uncompressed. At the moment the jena-elephas
>record reader deals with input based upon filename extensions (using RIOT
>Lang's) so .bz2 files hit an obvious unknown serialization error...
>
>Any thoughts on reading bzip2 compressed input files ?
>
>Az




Reply via email to