I solved that, by modifying the properties file, and throwing the warnings to another file. The half quads were caused by the warns, they printed themselves at the same time that the quads, cutting them.
> From: [email protected] > To: [email protected] > Subject: RE: Bulk load on several files > Date: Fri, 11 Jul 2014 10:44:49 -0300 > > I tried by the two ways, first I made a Re-Extraction of all files, this time > with nquads format. > Then I ran the riot command. And when I tried to do the bulk load it crashes: > > INFO Add: 1,400,000 Data (Batch: 83,056 / Avg: 82,440) > ERROR [line: 1447837, col: 16] Expected BNode or IRI: Got: [KEYWORD:riot] > org.apache.jena.riot.RiotException: [line: 1447837, col: 16] Expected BNode > or IRI: Got: [KEYWORD:riot] > > I checked on the large document that line: > 19:34:36 WARN riot :: [line: 35, col: 74] Bad IRI: > <http://www.w3.org/1999/xhtml/vocab#lytebox[vacation]> Code: > 0/ILLEGAL_CHARACTER in FRAGMENT: The character violates the grammar rules for > URIs/IRIs. > > It was generated by riot. This is the real quad: > <http://www.chip.de/artikel/Acer-Iconia_Tab_W500-Tablet-Test_49999382.html> > <http://www.w3.org/1999/xhtml/vocab#lytebox[vacation]> > <http://www.chip.de/artikel/Acer-Iconia_Tab_W500-Tablet-Test_49999382.html//ii/1/1/8/6/5/1/4/0/usb-d71b549bd8d991fc.JPG> > <http://www.chip.de/artikel/Acer-Iconia_Tab_W500-Tablet-Test_49999382.html> . > > So I proceeded to delete all the lines which contains WARN riot (because > they were many), and then it started to crash with half made quads (also > tried to delete the problematic quads, but they were too many). > > I decided to try again with the multiple file load. I generated a nt file > containing the name of all the nq documents. Like this: > _:uri137256 <http://example.org/name> > <file:/home/guidoz/workspace/extraidosconreview/https%3A%2F%2Formigo.com%2Fprodukte%2Fsonstiges%2Fwebdesign%2F.nq> > . > _:uri137257 <http://example.org/name> > <file:/home/guidoz/workspace/extraidosconreview/https%3A%2F%2Formigo.com%2Fprodukte%2Fsonstiges%2Fweb-development%2F.nq> > . > > I got this exception again > > guidoz@lifia-4872:~/workspace$ cat listaExtraidosConReview.nt | tdbloader2 > --loc /home/guidoz/workspace/rdfMaven/database > 10:43:10 -- TDB Bulk Loader Start > 10:43:10 Data phase > File does not exist: - > > > > Date: Mon, 7 Jul 2014 19:53:47 +0100 > > From: [email protected] > > To: [email protected] > > Subject: Re: Bulk load on several files > > > > On 07/07/14 14:49, Guido Zuccarelli wrote: > > > I did the following script: > > > for f in ~/workspace/extraidos/*; do > > > riot $f >> data.nt > > > doneIt was extremately slow, so I will have to do it using the first way. > > > > It's faster then the loading ... if nothing else, loading does that and > > also other work. > > > > > What did you mean when you said "the input must be in nt"?. > > > > tdloader can't know the syntax of stdin, is it assumes N-triples > > (commonly used for dumps) > > > > > A file with the name of all the ttl files written in triples? > > > for example: > > > > > > _:uri1 <name> 3D2658086.ttl > > > _:uri2 <name> 3D1218343208681.ttl > > > ... > > > > That isn't legal TTL or NT : This is illegal --> 3D2658086.ttl > > > > Andy > > > > > > > > Thanks again, > > > Guido > > > > > >> Date: Fri, 4 Jul 2014 21:12:23 +0100 > > >> From: [email protected] > > >> To: [email protected] > > >> Subject: Re: Bulk load on several files > > >> > > >> On 04/07/14 19:04, Guido Zuccarelli wrote: > > >>> Thank you! I think this would be the easier way. > > >>> You can go from ttl files to nt that easy? > > >> > > >> "riot" will output N-triples/N-Quads. > > >> > > >> In fact, it's a good idea to parse your files before loading - it > > >> catches syntax problems (inc warnings) that it's good to know about > > >> before loading. > > >> > > >> Andy > > >> > > >> > > >>> Best regards > > >>> > > >>> > > >>>> Date: Fri, 4 Jul 2014 18:48:12 +0100 > > >>>> From: [email protected] > > >>>> To: [email protected] > > >>>> Subject: Re: Bulk load on several files > > >>>> > > >>>> On 04/07/14 18:27, Andy Seaborne wrote: > > >>>>> On 04/07/14 17:20, Guido Zuccarelli wrote: > > >>>>>> Hello, > > >>>>>> > > >>>>>> I have a directory with 200,000+ ttl files that I want to > > >>>>>> load into a TDB database. The command help only specifies the > > >>>>>> sintaxis > > >>>>>> for one file load. > > >>>>> > > >>>>> tdbloader2 --help > > >>>>> ==> > > >>>>> Usage: tdbloader2 --loc location datafile ... > > >>>>> > > >>>>> "..." indicates as many files as you like. > > >>>>> > > >>>>>> I tried with the following command: > > >>>>>> > > >>>>>> cat ../listaExtraidos.txt | tdbloader2 --loc > > >>>>>> /home/guidoz/workspace/rdfMaven/database > > >>>>> > > >>>>> if it's reading from stdin, then the input must be N-quads (N-triples) > > >>>>> > > >>>>>> > > >>>>>> where listaExtraidos.txt is a space-separated list of ttl files > > >>>>>> obtained by the ls command. > > >>>>>> It hits me this exception: > > >>>>>> > > >>>>>> 12:35:17 -- TDB Bulk Loader Start > > >>>>>> 12:35:17 Data phase > > >>>>>> File does not exist: - > > >>>>> > > >>>>> A minor bug - just now fixed. > > >>>>> > > >>>>>> > > >>>>>> Is there any way to do this, or I will need to join the files? > > >>>> > > >>>> PS > > >>>> > > >>>> better to put all on the tdbloader2 if you can get 200K files there > > >>>> else ... > > >>>> > > >>>> Do not join files if they have any blank nodes. > > >>>> > > >>>> _:a is the same blank node within a file. > > >>>> > > >>>> If you do a blank node with label, after concatenation, it will be the > > >>>> same blank node in all files. > > >>>> > > >>>> > > >>>> for each file: > > >>>> riotcmd.riot file.ttl >> data.nt > > >>>> > > >>>> then tdbloader --loc whatever "data.nt" (or tdbloader2) > > >>>> > > >>>> The parser command "riot" will generate stable identifiers that don't > > >>>> clash. > > >>>> > > >>>> > > >>>> Andy > > >>>> > > >>>>>> > > >>>>>> Guido. > > >>>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >> > > > > > > > > >
