Thanks Andy, I will try with continue trying with the new versions of
Jena next week. Then I will report the results of loading. I known that
data is dirty, I'm also working on cleaning it.
Daniel
On Thu, 2015-04-30 at 15:18 +0100, Andy Seaborne wrote:
> On 26/04/15 17:44, Daniel Hernández wrote:
> >> Daniel,
> >>
> >> Than I'm baffled as to where the space is going. While the 15e6
> >> predicates is unusual, a 10G should be wildly too large and I don't
> >> see how it affects the loading at the point shown.
> >>
> >> For tdbloader2, it seems to be the stats space - that would be
> >> affected by 15e6 predciates. A very large heap should merely slow down
> >> loading, not still run out of space.(I'll look at adding a
> >> "--no-stats" flag anyway)
> >>
> >> Could I get a copy of your data to try up in my development setup
> >> with a profiler?
> >>
> >> Thanks
> >> Andy
> >
> > Andy, I publish the file in wd.degu.cl/rawfiles/d3.nt.gz.
> >
> > Thanks a lot!
> > Daniel
>
> Daniel,
>
> Yes - it looks like stats collection going wild (and, of course, given
> the shape of this data, fairly pointless!).
>
> I run a modified Jena and loaded 114e6 triples at 76K/s end-to-end with
> tdbloader2. I only used the default heap size.
>
> With stats collecting, I saw what you were seeing where it goes into GC
> overload. I suspect the resizing of the statistics map is also not
> helping performance either.
>
> For tdbloader2, there is now a "--nostats" the arguments but it must
> appear like this currently: --loc=... --nostats FILES. Test it is
> running properly by loading a small file and seeing whether stats.opt is
> no longer created in the database directory.
>
>
> There is a version of tdbloader which takes a --nostats argument now in
> the codebase but I'm still investigating how this does on your data
> profile - the wide range of RDFTerms used makes it slower (the
> triple/unique node ratio is significant in loading).
>
> There'll be a development build in the next 24 hours.
>
>
> This is all only in Jena3 - that's another issue.
>
> The data is quite messy.
>
> In Jena2, "riot --validate" will show warnings about a lot of things.
> Jena2 was mostly RDF1.1 but without introducing incompatibilities.
>
> In RDF 1.1, and Jena3 is RDF 1.1, it's illegal syntax because of
> characters in IRIs : '{', '}', '|', '^', '\'. For testing, I translated
> those to "_". If you use the \u forms of these characters, they will
> at least get into the database - there may be problems later in Jena2 or
> Jena3. Cleaning the data is better even for using just a couple of times.
>
> For Turtle/TriG/NTriple/NQuads syntaxes these are violations of the
> grammar. In RDF 1.0, N-Triples was not well defined as a general data
> syntax.
>
> Andy
>