Hi all,

after having a lot trouble with our group internal DBpedia mirror, I decided 
to do a fresh install.

I downloaded all the files (we only need a subset of the dbpedia: {en,de}) and 
followed this guide yesterday:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia

It uses the 
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript 
to load the dumps (btw: the script could greatly benefit from a few comments).

Now today I found these lines in the logs:

============
23:33:20 PL LOG: Loader started

                Sat Aug 07 2010
00:06:49 PL LOG:  File 
/usr/local/data/dbpedia/3.5.1/en/external_links_en.nt.gz error 23000 SR133: 
Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
00:32:56 Checkpoint started
00:34:39 Checkpoint finished, log reused
02:05:30 PL LOG:  File /usr/local/data/dbpedia/3.5.1/en/page_links_en.nt.gz 
error 23000 SR133: Can not set NULL to not nullable column 'DB.DBA.RDF_QUAD.O'
02:47:44 PL LOG: No more files to load. Loader has finished,
============

So I went to investigate where this comes from and it seems that inside the 
ld_file procedure of the VirtBulkRDFLoaderScript the error is caught. It uses 
the TTLP procedure to load the data.

1. Is there a possibility to for example get the line number where the error 
occurred? I did a few checks on the external_links_en.nt.gz file with zcat and 
grep and I think that a very long URL is the problem (see appendix).

2. Can I somehow tell virtuoso not to quit TTLP on such lines, but to either 
ignore or truncate them?

3. How much data was not inserted? The error handler calls a "rollback work", 
but it seems that in the rdf_loader_run procedure a "commit work" is done only 
after every 100 files loaded, which would mean that all which was inserted 
before is lost? At the same time log_enable(2, 1) is set, which means 
autocommit for every row, no log, so why is there a commit at all?

4. How do I continue? Can I simply restart with just these two files after 
fixing?

Cheers,
Jörn




"Appendix":
=========== did a few checks: ==============
SPARQL select count(*) where { ?s <http://dbpedia.org/property/reference> ?o . 
};
results in:
348955

At the same time the external_links_en.nt.gz file has:
5081932 lines (zcat external_links_en.nt.gz | wc -l)

The corresponding lines in the file look like this
(zcat external_links_en.nt.gz | cat -n | head -n $((348955+1)) | tail -n 2 ):
348955  <http://dbpedia.org/resource/Fourteen_Points> 
<http://dbpedia.org/property/reference> 
<http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points> .
348956  <http://dbpedia.org/resource/Fourteen_Points> 
<http://dbpedia.org/property/reference> 
<http://www.mtholyoke.edu/acad/intrel/doc31.htm> .

Notice the 's in line 348955, but actually as we got 348955 triples, the 
problem should've occured in the line after that one, but in line 348956 I see 
nothing wrong.

Also notice that
sparql select * where { <http://dbpedia.org/resource/Fourteen_Points> 
<http://dbpedia.org/property/reference> ?o .};                                  
                                                                                
                                   
results in:
http://www.loc.gov/exhibits/treasures/trm053.html                               
                                                                                
                                                                                
                              
http://wwi.lib.byu.edu/index.php/Main_Page                                      
                                                                                
                                                                                
                              
http://wwi.lib.byu.edu/index.php/President_Wilson's_Fourteen_Points
http://www.mtholyoke.edu/acad/intrel/doc31.htm
http://www.ourdocuments.gov/doc.php?flash=true&doc=62
http://web.jjay.cuny.edu/jobrien/reference/ob34.html

So line 348956 is imported.

Using nested intervals i found this:
356036  <http://dbpedia.org/resource/Antisocial_personality_disorder> 
<http://dbpedia.org/property/reference> 
<http://www.faculty.missouristate.edu/M/MichaelCarlie/what_I_learned_about/GANGS/WHYFORM/pathological.htm>
 
.
356037  <http://dbpedia.org/resource/Hugo_Simberg> 
<http://dbpedia.org/property/reference> 
<http://www.fng.fi/cgibin/art.pl?fi_collecti_ateneum_group_symbolis_hsimberg_all=hsimberg&w=X0769600&w=X0150100&w=X0152200&w=X0055000&w=X0710200&w=X0150500&w=X0750200&w=X0150900&w=X0148300&w=Y0119400&w=Y0183300&w=X0564300&w=X0168300&w=X0148700&w=X0168700&w=X0564700&w=X0491800&w=X0459400&w=X0788000&w=X0295700&w=X0788400&w=R9394800&w=X0786700&w=A8611800&w=X0151000&w=C8859000&w=X0412300&w=X0151400&w=X0660300&w=X0652200&w=X0151800&w=X0149200&w=X0147500&w=X0565200&w=X0654300&w=X0644500&w=A0600400&w=X0167500&w=X0472700&w=X0149600&w=X0147900&w=X0468400&w=X0167900&w=X0787200&w=X0769300&w=X0787600&w=X0769700&w=X0150200&w=X0150600&w=X0148000&w=X0405500&w=X0168000&w=X0564000&w=X0514600&w=X0093400&w=X0029500&w=X0148400&w=X0564400&w=X0168400&w=X0473600&w=X0457400&w=X0764300&w=X0148800&w=A0281100&w=X0491900&w=X0564800&w=A0605500&w=X0786400&w=X0788500&w=X0786800&w=X0778700&w=X0679800&w=A0598100&w=X0151100&w=X0151500&w=X0680000&w=X0024900&w=X0151900&w=X0149300&w=X0147600&w=R9395000&w=X0167600&w=X0149700&w=A0513500&w=F0106300&w=X0078900&w=X0769400&w=X0775800&w=X0787700&w=X0769800&w=X0710000&w=X0152000&w=X0150300&w=C9169600&w=C8858300&w=X0813100&w=X0150700&w=X0164300&w=X0168100&w=X0815200&w=X0148500&w=X0566200&w=X0564500&w=X0168500&w=X0790400&w=X0473700&w=X0764400&w=X0148900&w=X0788200&w=X0786500&w=X0788600&w=X0786900&w=X0679900&w=F0125900&w=X0151200&w=X0101800&w=C8859200&w=X0151600&w=X0149000&w=X0246200&w=X0743200&w=X0644300&w=X0149400&w=X0147700&w=X0654500&w=X0563700&w=X0167700&w=X0787000&w=C9002000&w=X0149800&w=X0654900&w=X0466900&w=X0787400&w=A8610400&w=X0769900&w=C8923300&w=X0150000&w=X0150400&w=X0150800&w=X0463200&w=X0148200&w=X0655000&w=X0168200&w=X0491300&w=X0564200&w=X0457200&w=X0148600&w=X0566300&w=X0564600&w=X0168600&w=X0764500&w=X0459700&w=X0788300&w=X0786600&w=X0151300&w=X0101900&w=X0660200&w=X0151700&w=X0149100&w=X0147400&w=X0743300&w=X0149500&w=X0147800&w=X0654600&w=X0563800&w=X0167800&w=X0337900&w=X0787100&w=X0149900&version=html4#first>
 
.
356038  <http://dbpedia.org/resource/Street> 
<http://dbpedia.org/property/reference> 
<http://www.jimwegryn.com/Names/Streets.htm> .

Line 356036 can still be found with sparql, 356038 can't. So most likely the 
very long URL in line 356037 causes the problem (it has 1966 chars)?

Reply via email to