Re: How to parse huge RDF data in a tar.gz file.

Yasunori Yamamoto Wed, 07 Aug 2019 09:58:03 -0700

Hi Andy,

Thank you for your reply.
Is the following code what you assume?
If so, it crashed with Exception in thread "main"
java.lang.NullPointerException.


TarArchiveInputStream tarInput = new TarArchiveInputStream(new ...);
TarArchiveEntry currentEntry;
while ((currentEntry = tarInput.getNextTarEntry()) != null) {
...
  parser_object = RDFParserBuilder
    .create()
    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
    .source(tarInput)
    .checking(checking)
    .lang(lang)
    .build();
...
}

Error stack follows.
at 
org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:296)
at java.io.InputStream.skip(java.base@9-internal/InputStream.java:351)
at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:111)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:344)
at 
org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:271)
at ... ( where my code calls tarInput.getNextTarEntry() )

Regards,
Yasunori

2019年8月7日(水) 18:04 Andy Seaborne <[email protected]>:
>
> Yasunori,
>
> It should be possible to pass the InputStream for the tar entry contents
> directly to the RDFParserBuilder.source, no need to convert to a string
> first.
>
> IIRC TarArchiveInputStream is a bit weird - it signals "end of file" at
> the end of the tar archive entry, the the app moves to the next entry
> and the input stream is then for that entry and can be passed to a new
> RDFParserBuilder call.
>
> An RDFParser does not close an inputStream it is passed.
>
> It will need a new RDFParser for each entry.
>
> If that is now hat is happened, please let us know.
>
>      Andy
>
>
> On 06/08/2019 23:31, Yasunori Yamamoto wrote:
> > Files in a tar are in RDF/XML or Turtle.
> >
> > Yasunori
> >
> > 2019/08/07 3:11、ajs6f <[email protected]>のメール:
> >
> > In what format are these RDF files?
> >
> > ajs6f
> >
> >> On Aug 6, 2019, at 10:05 AM, Yasunori Yamamoto <[email protected]> 
> >> wrote:
> >>
> >> Hello, I'm trying to learn how to parse RDF data archived in a tar.gz
> >> file (e.g., rdfdatasets.tar.gz that contains a set of RDF data files)
> >> within my Java program.
> >> The following code does work properly, but it is inefficient because
> >> the process reads and loads the entire RDF data in an entry of the
> >> given tar.gz file into a main memory before parsing.
> >> So, could you please let me know a better way to save a memory space ?
> >>
> >> TarArchiveInputStream tarInput = new TarArchiveInputStream(new
> >> GzipCompressorInputStream(new FileInputStream(filename)));
> >> TarArchiveEntry currentEntry;
> >> PipedRDFIterator<Triple> iter = new
> >> PipedRDFIterator<Triple>(buffersize, false, pollTimeout, maxPolls);
> >> final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iter);
> >>
> >> while ((currentEntry = tarInput.getNextTarEntry()) != null) {
> >> String currentFile = currentEntry.getName();
> >> Lang lang = RDFLanguages.filenameToLang(currentFile);
> >> parser_object = RDFParserBuilder
> >>    .create()
> >>    .errorHandler(ErrorHandlerFactory.errorHandlerDetailed())
> >>    .source(new StringReader(CharStreams.toString(new
> >> InputStreamReader(tarInput))))
> >>    .checking(checking)
> >>    .lang(lang)
> >>    .build();
> >> parser_object.parse(inputStream);
> >> }
> >> tarInput.close();
> >>
> >> Sincerely yours,
> >> Yasunori Yamamoto

Re: How to parse huge RDF data in a tar.gz file.

Reply via email to