On 26/04/15 17:44, Daniel Hernández wrote:
Daniel,
Than I'm baffled as to where the space is going. While the 15e6
predicates is unusual, a 10G should be wildly too large and I don't
see how it affects the loading at the point shown.
For tdbloader2, it seems to be the stats space - that would be
affected by 15e6 predciates. A very large heap should merely slow down
loading, not still run out of space.(I'll look at adding a
"--no-stats" flag anyway)
Could I get a copy of your data to try up in my development setup
with a profiler?
Thanks
Andy
Andy, I publish the file in wd.degu.cl/rawfiles/d3.nt.gz.
Thanks a lot!
Daniel
Daniel,
Yes - it looks like stats collection going wild (and, of course, given
the shape of this data, fairly pointless!).
I run a modified Jena and loaded 114e6 triples at 76K/s end-to-end with
tdbloader2. I only used the default heap size.
With stats collecting, I saw what you were seeing where it goes into GC
overload. I suspect the resizing of the statistics map is also not
helping performance either.
For tdbloader2, there is now a "--nostats" the arguments but it must
appear like this currently: --loc=... --nostats FILES. Test it is
running properly by loading a small file and seeing whether stats.opt is
no longer created in the database directory.
There is a version of tdbloader which takes a --nostats argument now in
the codebase but I'm still investigating how this does on your data
profile - the wide range of RDFTerms used makes it slower (the
triple/unique node ratio is significant in loading).
There'll be a development build in the next 24 hours.
This is all only in Jena3 - that's another issue.
The data is quite messy.
In Jena2, "riot --validate" will show warnings about a lot of things.
Jena2 was mostly RDF1.1 but without introducing incompatibilities.
In RDF 1.1, and Jena3 is RDF 1.1, it's illegal syntax because of
characters in IRIs : '{', '}', '|', '^', '\'. For testing, I translated
those to "_". If you use the \u forms of these characters, they will
at least get into the database - there may be problems later in Jena2 or
Jena3. Cleaning the data is better even for using just a couple of times.
For Turtle/TriG/NTriple/NQuads syntaxes these are violations of the
grammar. In RDF 1.0, N-Triples was not well defined as a general data
syntax.
Andy