Hi all,
after having a lot trouble with our group internal DBpedia mirror, I decided
to do a fresh install.
I downloaded all the files (we only need a subset of the dbpedia: {en,de}) and
followed this guide yesterday:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia
It uses the
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript
to load the dumps (btw: the script could greatly benefit from a few comments).
Now today I found these lines in the logs:
============
23:33:20 PL LOG: Loader started
Sat Aug 07 2010
00:06:49 PL LOG: File
/usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000 SR133:
Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
00:32:56 Checkpoint started
00:34:39 Checkpoint finished, log reused
02:05:30 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/page_links_en.nt.gz
error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
02:47:44 PL LOG: No more files to load. Loader has finished,
============
So I went to investigate where this comes from and it seems that inside the
ld_file procedure of the VirtBulkRDFLoaderScript the error is caught. It uses
the TTLP procedure to load the data.
1. Is there a possibility to for example get the line number where the error
occurred? I did a few checks on the external_links_en.nt.gz file with zcat and
grep and I think that a very long URL is the problem (see appendix).
2. Can I somehow tell virtuoso not to quit TTLP on such lines, but to either
ignore or truncate them?
3. How much data was not inserted? The error handler calls a "rollback work",
but it seems that in the rdf_loader_run procedure a "commit work" is done only
after every 100 files loaded, which would mean that all which was inserted
before is lost? At the same time log_enable(2, 1) is set, which means
autocommit for every row, no log, so why is there a commit at all?
4. How do I continue? Can I simply restart with just these two files after
fixing?
Cheers,
Jörn
"Appendix":
=========== did a few checks: ==============
SPARQL select count(*) where { ?s <http://dbpedia.org/property/reference> ?o .
};
results in:
348955
At the same time the external_links_en.nt.gz file has:
5081932 lines (zcat external_links_en.nt.gz | wc -l)
The corresponding lines in the file look like this
(zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) | tail -n 2 ):
348955 <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference>
<http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points> .
348956 <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference>
<http://www.mtholyoke.edu/acad/intrel/doc31.htm> .
Notice the 's in line 348955, but actually as we got 348955 triples, the
problem should've occured in the line after that one, but in line 348956 I see
nothing wrong.
Also notice that
sparql select * where { <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference> ?o .};
results in:
http://www.loc.gov/exhibits/treasures/trm053.html
http://wwi.lib.byu.edu/index.php/Main_Page
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points
http://www.mtholyoke.edu/acad/intrel/doc31.htm
http://www.ourdocuments.gov/doc.php?flash=true&doc=62
http://web.jjay.cuny.edu/jobrien/reference/ob34.html
So line 348956 is imported.
Using nested intervals i found this:
356036 <http://dbpedia.org/resource/Antisocial_personality_disorder>
<http://dbpedia.org/property/reference>
<http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm>
.
356037 <http://dbpedia.org/resource/Hugo_Simberg>
<http://dbpedia.org/property/reference>
<http://www.fng.fi/cgibin/art.pl?fi_collecti_ateneum_group_symbolis_hsimberg_all=hsimberg&w=X0769600&w=X0150100&w=X0152200&w=X0055000&w=X0710200&w=X0150500&w=X0750200&w=X0150900&w=X0148300&w=Y0119400&w=Y0183300&w=X0564300&w=X0168300&w=X0148700&w=X0168700&w=X0564700&w=X0491800&w=X0459400&w=X0788000&w=X0295700&w=X0788400&w=R9394800&w=X0786700&w=A8611800&w=X0151000&w=C8859000&w=X0412300&w=X0151400&w=X0660300&w=X0652200&w=X0151800&w=X0149200&w=X0147500&w=X0565200&w=X0654300&w=X0644500&w=A0600400&w=X0167500&w=X0472700&w=X0149600&w=X0147900&w=X0468400&w=X0167900&w=X0787200&w=X0769300&w=X0787600&w=X0769700&w=X0150200&w=X0150600&w=X0148000&w=X0405500&w=X0168000&w=X0564000&w=X0514600&w=X0093400&w=X0029500&w=X0148400&w=X0564400&w=X0168400&w=X0473600&w=X0457400&w=X0764300&w=X0148800&w=A0281100&w=X0491900&w=X0564800&w=A0605500&w=X0786400&w=X0788500&w=X0786800&w=X0778700&w=X0679800&w=A0598100&w=X0151100&w=X0151500&w=X0680000&w=X0024900&w=X0151900&w=X0149300&w=X0147600&w=R9395000&w=X0167600&w=X0149700&w=A0513500&w=F0106300&w=X0078900&w=X0769400&w=X0775800&w=X0787700&w=X0769800&w=X0710000&w=X0152000&w=X0150300&w=C9169600&w=C8858300&w=X0813100&w=X0150700&w=X0164300&w=X0168100&w=X0815200&w=X0148500&w=X0566200&w=X0564500&w=X0168500&w=X0790400&w=X0473700&w=X0764400&w=X0148900&w=X0788200&w=X0786500&w=X0788600&w=X0786900&w=X0679900&w=F0125900&w=X0151200&w=X0101800&w=C8859200&w=X0151600&w=X0149000&w=X0246200&w=X0743200&w=X0644300&w=X0149400&w=X0147700&w=X0654500&w=X0563700&w=X0167700&w=X0787000&w=C9002000&w=X0149800&w=X0654900&w=X0466900&w=X0787400&w=A8610400&w=X0769900&w=C8923300&w=X0150000&w=X0150400&w=X0150800&w=X0463200&w=X0148200&w=X0655000&w=X0168200&w=X0491300&w=X0564200&w=X0457200&w=X0148600&w=X0566300&w=X0564600&w=X0168600&w=X0764500&w=X0459700&w=X0788300&w=X0786600&w=X0151300&w=X0101900&w=X0660200&w=X0151700&w=X0149100&w=X0147400&w=X0743300&w=X0149500&w=X0147800&w=X0654600&w=X0563800&w=X0167800&w=X0337900&w=X0787100&w=X0149900&version=html4#first>
.
356038 <http://dbpedia.org/resource/Street>
<http://dbpedia.org/property/reference>
<http://www.jimwegryn.com/Names/Streets.htm> .
Line 356036 can still be found with sparql, 356038 can't. So most likely the
very long URL in line 356037 causes the problem (it has 1966 chars)?