On 26/04/15 17:44, Daniel Hernández wrote:
Daniel,

Than I'm baffled as to where the space is going.  While the 15e6
predicates is unusual, a 10G should be wildly too large and I don't
see how it affects the loading at the point shown.

For tdbloader2, it seems to be the stats space - that would be
affected by 15e6 predciates. A very large heap should merely slow down
loading, not still run out of space.(I'll look at adding a
"--no-stats" flag anyway)

Could I get a copy of your data to try up in my development setup
with a profiler?

Thanks
Andy

Andy, I publish the file in wd.degu.cl/rawfiles/d3.nt.gz.

Thanks a lot!
Daniel

Daniel,

Yes - it looks like stats collection going wild (and, of course, given the shape of this data, fairly pointless!).

I run a modified Jena and loaded 114e6 triples at 76K/s end-to-end with tdbloader2. I only used the default heap size.

With stats collecting, I saw what you were seeing where it goes into GC overload. I suspect the resizing of the statistics map is also not helping performance either.

For tdbloader2, there is now a "--nostats" the arguments but it must appear like this currently: --loc=... --nostats FILES. Test it is running properly by loading a small file and seeing whether stats.opt is no longer created in the database directory.


There is a version of tdbloader which takes a --nostats argument now in the codebase but I'm still investigating how this does on your data profile - the wide range of RDFTerms used makes it slower (the triple/unique node ratio is significant in loading).

There'll be a development build in the next 24 hours.


This is all only in Jena3 - that's another issue.

The data is quite messy.

In Jena2, "riot --validate" will show warnings about a lot of things. Jena2 was mostly RDF1.1 but without introducing incompatibilities.

In RDF 1.1, and Jena3 is RDF 1.1, it's illegal syntax because of characters in IRIs : '{', '}', '|', '^', '\'. For testing, I translated those to "_". If you use the \u forms of these characters, they will at least get into the database - there may be problems later in Jena2 or Jena3. Cleaning the data is better even for using just a couple of times.

For Turtle/TriG/NTriple/NQuads syntaxes these are violations of the grammar. In RDF 1.0, N-Triples was not well defined as a general data syntax.

        Andy

Reply via email to