On 08/09/12 08:10, Phani Sajja wrote:
Thanks Andy,


2/ Better: run "riot" on the files first to validate them and convert to
N-Triples, keep the N-Triples output and load those.

Much better to "check then load" than have a large load crash due to bad
data.

Parsing of complex formats like RDF/XML slows the bulk loader down.


I followed the above step
1. Validate the RDF/XML, Convert RDF/XML to N-Triples using
*rdfparse*command line tool
3. Load N-Triples output to TDB using *tdbloader *command line tool
*
*
Command: *tdbloader* --loc ~/development/odp-rdf/ content.n3

.nt is n-triples

.n3 is N3 (it so happens N-triples is a subset of N3 - and the Jena N3 parser isn't an N3 parser - it's is Turtle!).


Loading is finished with three types of warnings

    - {W107} Bad URI:
    - {W131} String not in Unicode Normal Form C:
    - {W121} String is not legal in XML 1.1;

After loading it gives me
Completed: 22,389,276 triples loaded in 4,309.30 seconds [Rate: 5,195.57
per second]

I tried to count the triples using SPARQL query

SELECT (count(*) AS ?count) { ?s ?p ?o }

Triple count = 21669903

Does tdbloader omits loading the tuples with warnings.

Why there is a change in the number of triples

Duplicates.

The "Completed:" is number of triples seen, each duplicate counts.

Teh query sees unique triples.

The indexing step should show 21,669,903

        Andy




Reply via email to