Thanks Andy, I will try with continue trying with the new versions of
Jena next week. Then I will report the results of loading. I known that
data is dirty, I'm also working on cleaning it.

Daniel

On Thu, 2015-04-30 at 15:18 +0100, Andy Seaborne wrote:
> On 26/04/15 17:44, Daniel Hernández wrote:
> >> Daniel,
> >>
> >> Than I'm baffled as to where the space is going.  While the 15e6
> >> predicates is unusual, a 10G should be wildly too large and I don't
> >> see how it affects the loading at the point shown.
> >>
> >> For tdbloader2, it seems to be the stats space - that would be
> >> affected by 15e6 predciates. A very large heap should merely slow down
> >> loading, not still run out of space.(I'll look at adding a
> >> "--no-stats" flag anyway)
> >>
> >> Could I get a copy of your data to try up in my development setup
> >> with a profiler?
> >>
> >> Thanks
> >> Andy
> >
> > Andy, I publish the file in wd.degu.cl/rawfiles/d3.nt.gz.
> >
> > Thanks a lot!
> > Daniel
> 
> Daniel,
> 
> Yes - it looks like stats collection going wild (and, of course, given 
> the shape of this data, fairly pointless!).
> 
> I run a modified Jena and loaded 114e6 triples at 76K/s end-to-end with 
> tdbloader2.  I only used the default heap size.
> 
> With stats collecting, I saw what you were seeing where it goes into GC 
> overload. I suspect the resizing of the statistics map is also not 
> helping performance either.
> 
> For tdbloader2, there is now a "--nostats" the arguments but it must 
> appear like this currently: --loc=... --nostats FILES.  Test it is 
> running properly by loading a small file and seeing whether stats.opt is 
> no longer created in the database directory.
> 
> 
> There is a version of tdbloader which takes a --nostats argument now in 
> the codebase but I'm still investigating how this does on your data 
> profile - the wide range of RDFTerms used makes it slower (the 
> triple/unique node ratio is significant in loading).
> 
> There'll be a development build in the next 24 hours.
> 
> 
> This is all only in Jena3 - that's another issue.
> 
> The data is quite messy.
> 
> In Jena2, "riot --validate" will show warnings about a lot of things. 
> Jena2 was mostly RDF1.1 but without introducing incompatibilities.
> 
> In RDF 1.1, and Jena3 is RDF 1.1,  it's illegal syntax because of 
> characters in IRIs : '{', '}', '|', '^', '\'.  For testing, I translated 
> those to "_".   If you use the \u forms of these characters, they will 
> at least get into the database - there may be problems later in Jena2 or 
> Jena3.  Cleaning the data is better even for using just a couple of times.
> 
> For Turtle/TriG/NTriple/NQuads syntaxes these are violations of the 
> grammar.  In RDF 1.0, N-Triples was not well defined as a general data 
> syntax.
> 
>       Andy
> 


Reply via email to