Hi Jörn,

after having a lot trouble with our group internal DBpedia mirror, I decided
to do a fresh install.

I downloaded all the files (we only need a subset of the dbpedia: {en,de}) and
followed this guide yesterday:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia

It uses the
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript
to load the dumps (btw: the script could greatly benefit from a few comments).

Now today I found these lines in the logs:

============
23:33:20 PL LOG: Loader started

               Sat Aug 07 2010
00:06:49 PL LOG:  File
/usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000 SR133:
Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
00:32:56 Checkpoint started
00:34:39 Checkpoint finished, log reused
02:05:30 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/ page_links_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
02:47:44 PL LOG: No more files to load. Loader has finished,
============

So I went to investigate where this comes from and it seems that inside the ld_file procedure of the VirtBulkRDFLoaderScript the error is caught. It uses
the TTLP procedure to load the data.

1. Is there a possibility to for example get the line number where the error occurred? I did a few checks on the external_links_en.nt.gz file with zcat and
grep and I think that a very long URL is the problem (see appendix).


The loader process creates a table called load_list which records which datasets where not loaded completely and for what reason.

Please try:

        select * from load_list where where ll_error is not null;

This should give you good indication of what is going on, including the line number where the failure happened.


2. Can I somehow tell virtuoso not to quit TTLP on such lines, but to either
ignore or truncate them?

There are some flags to the TTLP code that would probably skip this errror, but in certain cases that could insert partial data into the database, which is much harder to clean up, so we have not made it default.


3. How much data was not inserted? The error handler calls a "rollback work", but it seems that in the rdf_loader_run procedure a "commit work" is done only after every 100 files loaded, which would mean that all which was inserted
before is lost? At the same time log_enable(2, 1) is set, which means
autocommit for every row, no log, so why is there a commit at all?


The commit is there because you can overrule the log_enable when calling the rdf_loader_run call. Since you did not, it was on autocommit so all the triples loaded before the bad triple have been committed.


4. How do I continue? Can I simply restart with just these two files after
fixing?


When you have found the offending line, you can dump all the remaining triples of this file to a new file and just load that partial result. Since all records before this point in the file are already in, that should complete without this one bad record.


"Appendix":
=========== did a few checks: ==============
SPARQL select count(*) where { ?s <http://dbpedia.org/property/reference > ?o .
};
results in:
348955

At the same time the external_links_en.nt.gz file has:
5081932 lines (zcat external_links_en.nt.gz | wc -l)

The corresponding lines in the file look like this
(zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) | tail -n 2 ):
348955  <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference>
<http://wwi.lib.byu.edu/index.php/ President_Wilson's_Fourteen_Points> .
348956  <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference>
<http://www.mtholyoke.edu/acad/intrel/doc31.htm> .

Notice the 's in line 348955, but actually as we got 348955 triples, the problem should've occured in the line after that one, but in line 348956 I see
nothing wrong.

Also notice that
sparql select * where { <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference> ?o .};
results in:
http://www.loc.gov/exhibits/treasures/trm053.html
http://wwi.lib.byu.edu/index.php/Main_Page
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points
http://www.mtholyoke.edu/acad/intrel/doc31.htm
http://www.ourdocuments.gov/doc.php?flash=true&doc=62
http://web.jjay.cuny.edu/jobrien/reference/ob34.html

So line 348956 is imported.

Using nested intervals i found this:
356036  <http://dbpedia.org/resource/Antisocial_personality_disorder>
<http://dbpedia.org/property/reference>
<http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm >
.
356037  <http://dbpedia.org/resource/Hugo_Simberg>
<http://dbpedia.org/property/reference>
<http://www.fng.fi/cgibin/art.pl?fi_collecti_ateneum_group_symbolis_hsimberg_all=hsimberg&w=X0769600&w=X0150100&w=X0152200&w=X0055000&w=X0710200&w=X0150500&w=X0750200&w=X0150900&w=X0148300&w=Y0119400&w=Y0183300&w=X0564300&w=X0168300&w=X0148700&w=X0168700&w=X0564700&w=X0491800&w=X0459400&w=X0788000&w=X0295700&w=X0788400&w=R9394800&w=X0786700&w=A8611800&w=X0151000&w=C8859000&w=X0412300&w=X0151400&w=X0660300&w=X0652200&w=X0151800&w=X0149200&w=X0147500&w=X0565200&w=X0654300&w=X0644500&w=A0600400&w=X0167500&w=X0472700&w=X0149600&w=X0147900&w=X0468400&w=X0167900&w=X0787200&w=X0769300&w=X0787600&w=X0769700&w=X0150200&w=X0150600&w=X0148000&w=X0405500&w=X0168000&w=X0564000&w=X0514600&w=X0093400&w=X0029500&w=X0148400&w=X0564400&w=X0168400&w=X0473600&w=X0457400&w=X0764300&w=X0148800&w=A0281100&w=X0491900&w=X0564800&w=A0605500&w=X0786400&w=X0788500&w=X0786800&w=X0778700&w=X0679800&w=A0598100&w=X0151100&w=X0151500&w=X0680000&w=X0024900&w=X0151900&w=X0149300&w=X0147600&w=R9395000&w=X01 67600&w=X0149700&w=A0513500&w=F0106300&w=X0078900&w=X0769400&w=X0775800&w=X0787700&w=X0769800&w=X0710000&w=X0152000&w=X0150300&w=C9169600&w=C8858300&w=X0813100&w=X0150700&w=X0164300&w=X0168100&w=X0815200&w=X0148500&w=X0566200&w=X0564500&w=X0168500&w=X0790400&w=X0473700&w=X0764400&w=X0148900&w=X0788200&w=X0786500&w=X0788600&w=X0786900&w=X0679900&w=F0125900&w=X0151200&w=X0101800&w=C8859200&w=X0151600&w=X0149000&w=X0246200&w=X0743200&w=X0644300&w=X0149400&w=X0147700&w=X0654500&w=X0563700&w=X0167700&w=X0787000&w=C9002000&w=X0149800&w=X0654900&w=X0466900&w=X0787400&w=A8610400&w=X0769900&w=C8923300&w=X0150000&w=X0150400&w=X0150800&w=X0463200&w=X0148200&w=X0655000&w=X0168200&w=X0491300&w=X0564200&w=X0457200&w=X0148600&w=X0566300&w=X0564600&w=X0168600&w=X0764500&w=X0459700&w=X0788300&w=X0786600&w=X0151300&w=X0101900&w=X0660200&w=X0151700&w=X0149100&w=X0147400&w=X0743300&w=X0149500&w=X0147800&w=X0654600&w=X0563800&w=X0167800&w=X0337900&w=X0787100&w=X0149900&version=html4#first >
.
356038  <http://dbpedia.org/resource/Street>
<http://dbpedia.org/property/reference>
<http://www.jimwegryn.com/Names/Streets.htm> .

Line 356036 can still be found with sparql, 356038 can't. So most likely the
very long URL in line 356037 causes the problem (it has 1966 chars)?

Yes, that is probably the record that is failing. The next dbpedia dump will have a fix for this.



Best regards,


Patrick

Reply via email to