Hi Jörn,
after having a lot trouble with our group internal DBpedia mirror, I
decided
to do a fresh install.
I downloaded all the files (we only need a subset of the dbpedia:
{en,de}) and
followed this guide yesterday:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia
It uses the
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript
to load the dumps (btw: the script could greatly benefit from a few
comments).
Now today I found these lines in the logs:
============
23:33:20 PL LOG: Loader started
Sat Aug 07 2010
00:06:49 PL LOG: File
/usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000
SR133:
Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
00:32:56 Checkpoint started
00:34:39 Checkpoint finished, log reused
02:05:30 PL LOG: File /usr/local/data/dbpedia/3.5.1/en/
page_links_en.nt.gz
error 23000 SR133: Can not set NULL to not nullable column
'DB.DBA.RDF_QUAD.O'
02:47:44 PL LOG: No more files to load. Loader has finished,
============
So I went to investigate where this comes from and it seems that
inside the
ld_file procedure of the VirtBulkRDFLoaderScript the error is
caught. It uses
the TTLP procedure to load the data.
1. Is there a possibility to for example get the line number where
the error
occurred? I did a few checks on the external_links_en.nt.gz file
with zcat and
grep and I think that a very long URL is the problem (see appendix).
The loader process creates a table called load_list which records
which datasets where not loaded completely and for what reason.
Please try:
select * from load_list where where ll_error is not null;
This should give you good indication of what is going on, including
the line number where the failure happened.
2. Can I somehow tell virtuoso not to quit TTLP on such lines, but
to either
ignore or truncate them?
There are some flags to the TTLP code that would probably skip this
errror, but in certain cases that could insert partial data into the
database, which is much harder to clean up, so we have not made it
default.
3. How much data was not inserted? The error handler calls a
"rollback work",
but it seems that in the rdf_loader_run procedure a "commit work" is
done only
after every 100 files loaded, which would mean that all which was
inserted
before is lost? At the same time log_enable(2, 1) is set, which means
autocommit for every row, no log, so why is there a commit at all?
The commit is there because you can overrule the log_enable when
calling the rdf_loader_run call. Since you did not, it was on
autocommit so all the triples loaded before the bad triple have been
committed.
4. How do I continue? Can I simply restart with just these two files
after
fixing?
When you have found the offending line, you can dump all the remaining
triples of this file to a new file and just load that partial result.
Since all records before this point in the file are already in, that
should complete without this one bad record.
"Appendix":
=========== did a few checks: ==============
SPARQL select count(*) where { ?s <http://dbpedia.org/property/reference
> ?o .
};
results in:
348955
At the same time the external_links_en.nt.gz file has:
5081932 lines (zcat external_links_en.nt.gz | wc -l)
The corresponding lines in the file look like this
(zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) |
tail -n 2 ):
348955 <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference>
<http://wwi.lib.byu.edu/index.php/
President_Wilson's_Fourteen_Points> .
348956 <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference>
<http://www.mtholyoke.edu/acad/intrel/doc31.htm> .
Notice the 's in line 348955, but actually as we got 348955 triples,
the
problem should've occured in the line after that one, but in line
348956 I see
nothing wrong.
Also notice that
sparql select * where { <http://dbpedia.org/resource/Fourteen_Points>
<http://dbpedia.org/property/reference> ?o .};
results in:
http://www.loc.gov/exhibits/treasures/trm053.html
http://wwi.lib.byu.edu/index.php/Main_Page
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points
http://www.mtholyoke.edu/acad/intrel/doc31.htm
http://www.ourdocuments.gov/doc.php?flash=true&doc=62
http://web.jjay.cuny.edu/jobrien/reference/ob34.html
So line 348956 is imported.
Using nested intervals i found this:
356036 <http://dbpedia.org/resource/Antisocial_personality_disorder>
<http://dbpedia.org/property/reference>
<http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm
>
.
356037 <http://dbpedia.org/resource/Hugo_Simberg>
<http://dbpedia.org/property/reference>
<http://www.fng.fi/cgibin/art.pl?fi_collecti_ateneum_group_symbolis_hsimberg_all=hsimberg&w=X0769600&w=X0150100&w=X0152200&w=X0055000&w=X0710200&w=X0150500&w=X0750200&w=X0150900&w=X0148300&w=Y0119400&w=Y0183300&w=X0564300&w=X0168300&w=X0148700&w=X0168700&w=X0564700&w=X0491800&w=X0459400&w=X0788000&w=X0295700&w=X0788400&w=R9394800&w=X0786700&w=A8611800&w=X0151000&w=C8859000&w=X0412300&w=X0151400&w=X0660300&w=X0652200&w=X0151800&w=X0149200&w=X0147500&w=X0565200&w=X0654300&w=X0644500&w=A0600400&w=X0167500&w=X0472700&w=X0149600&w=X0147900&w=X0468400&w=X0167900&w=X0787200&w=X0769300&w=X0787600&w=X0769700&w=X0150200&w=X0150600&w=X0148000&w=X0405500&w=X0168000&w=X0564000&w=X0514600&w=X0093400&w=X0029500&w=X0148400&w=X0564400&w=X0168400&w=X0473600&w=X0457400&w=X0764300&w=X0148800&w=A0281100&w=X0491900&w=X0564800&w=A0605500&w=X0786400&w=X0788500&w=X0786800&w=X0778700&w=X0679800&w=A0598100&w=X0151100&w=X0151500&w=X0680000&w=X0024900&w=X0151900&w=X0149300&w=X0147600&w=R9395000&w=X01
67600&w=X0149700&w=A0513500&w=F0106300&w=X0078900&w=X0769400&w=X0775800&w=X0787700&w=X0769800&w=X0710000&w=X0152000&w=X0150300&w=C9169600&w=C8858300&w=X0813100&w=X0150700&w=X0164300&w=X0168100&w=X0815200&w=X0148500&w=X0566200&w=X0564500&w=X0168500&w=X0790400&w=X0473700&w=X0764400&w=X0148900&w=X0788200&w=X0786500&w=X0788600&w=X0786900&w=X0679900&w=F0125900&w=X0151200&w=X0101800&w=C8859200&w=X0151600&w=X0149000&w=X0246200&w=X0743200&w=X0644300&w=X0149400&w=X0147700&w=X0654500&w=X0563700&w=X0167700&w=X0787000&w=C9002000&w=X0149800&w=X0654900&w=X0466900&w=X0787400&w=A8610400&w=X0769900&w=C8923300&w=X0150000&w=X0150400&w=X0150800&w=X0463200&w=X0148200&w=X0655000&w=X0168200&w=X0491300&w=X0564200&w=X0457200&w=X0148600&w=X0566300&w=X0564600&w=X0168600&w=X0764500&w=X0459700&w=X0788300&w=X0786600&w=X0151300&w=X0101900&w=X0660200&w=X0151700&w=X0149100&w=X0147400&w=X0743300&w=X0149500&w=X0147800&w=X0654600&w=X0563800&w=X0167800&w=X0337900&w=X0787100&w=X0149900&version=html4#first
>
.
356038 <http://dbpedia.org/resource/Street>
<http://dbpedia.org/property/reference>
<http://www.jimwegryn.com/Names/Streets.htm> .
Line 356036 can still be found with sparql, 356038 can't. So most
likely the
very long URL in line 356037 causes the problem (it has 1966 chars)?
Yes, that is probably the record that is failing. The next dbpedia
dump will have a fix for this.
Best regards,
Patrick