Re: [Virtuoso-users] strange error when bulk-loading Turtle files

2018-12-18 Thread Hugh Williams
Hi Peter,

I generated the datasets from your python script and loaded them into a local 
Virtuoso open source multiple times but did not see any occurrences of the 
error:

SQL> select * from load_list;
ll_file 
  ll_graph  
ll_statell_started   ll_done  ll_host 
ll_work_time  ll_error
VARCHAR NOT NULL
  VARCHAR   
INTEGER TIMESTAMPTIMESTAMPINTEGER INTEGER   
  VARCHAR
___

./wikidata/test00.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.54 983316000  0   
NULLNULL
./wikidata/test01.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 10566  0   
NULLNULL
./wikidata/test02.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 233562000  0   
NULLNULL
./wikidata/test03.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 371457000  0   
NULLNULL
./wikidata/test04.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 483846000  0   
NULLNULL
./wikidata/test05.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 621974000  0   
NULLNULL
./wikidata/test06.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 742255000  0   
NULLNULL
./wikidata/test07.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 860062000  0   
NULLNULL
./wikidata/test08.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.55 993561000  0   
NULLNULL
./wikidata/test09.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 826749000  2018.12.19 0:47.56 140431000  0   
NULLNULL
./wikidata/test10.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 82779  2018.12.19 0:47.54 985386000  0   
NULLNULL
./wikidata/test11.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 82779  2018.12.19 0:47.55 109072000  0   
NULLNULL
./wikidata/test12.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 82779  2018.12.19 0:47.55 230846000  0   
NULLNULL
./wikidata/test13.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 82779  2018.12.19 0:47.55 375427000  0   
NULLNULL
./wikidata/test14.ttl   
  http://test.nuance.com
2   2018.12.19 0:47.54 82779  2018.12.19 0:47.55 486963000  0   
NULLNULL
./wikidata/test15.ttl   
  http://test.nuance.com
2   

Re: [Virtuoso-users] strange error when bulk-loading Turtle files

2018-12-18 Thread Peter F. Patel-Schneider
I created some synthetic data that tickles the bug reliably on my machine with
a standard virtuoso.ini (just adding the directory for the files to the
allowed list).  I'm attaching the generator program for the files and a
loading script.

peter


On 12/18/18 9:46 AM, Peter F. Patel-Schneider wrote:
> I did a bit of digging and it sure looks as if there is a race condition in
> rdf_rl_lang_id in ttlpv.sql.   This code appears to check to see if the
> language tag is already in DB.DBA.RDF_LANGUAGE and adds it if not.  But
> another thread could do the same insert between the check and the insert, as
> far as I can tell.
> 
> It looks to me as if the right solution is to do a soft insert and a
> subsequent query instead of a hard insert.
> 
> However, I don't understand how locking works in SQL so there may be something
> that prevents another thread from interfering.
> 
> peter
> 
> 
> On 12/18/18 8:55 AM, Peter F. Patel-Schneider wrote:
>> I'm loading the Turtle Wikidata RDF complete dump, split into pieces and
>> loaded with 10 active readers.   About half the time the load fails with one
>> or more of these errors.  The errors are always near the beginning of the
>> load---in the first group of 10 files to be loaded and near the beginning of
>> the files (generally in the first couple of hundred lines in a file of size
>> well over 1 GB).  No errors occur for any files beyond the first ten.
>>
>> I could provide the files, but they total to about 340GB.
>>
>> It sure looks as if there is some sort of bug when loading RDF 
>> language-tagged
>> strings, where a race condition means that two threads are trying to load the
>> same language tag into DB.DBA.RDF_LANGUAGE.  This would explain why the
>> problem occurs only at the beginning of the load, when the language tags are
>> being added to DB.DBA.RDF_LANGUAGE, and not later.  It would also explain why
>> the errors are different between different runs.  (The only other explanation
>> would be hardware errors, but this doesn't seem to be viable.)
>>
>> It seems to me that a quick patch for this problem would be to change the
>> insert into a soft insert, but I don't know where to make this change in the 
>> code.
>>
>> peter
>>
>>
>>
>>
>> On 12/11/18 7:11 PM, Hugh Williams wrote:
>>> Hi Peter,
>>>
>>> The triple value do indeed appear to be valid, but the problem could be
>>> somewhere else in the dataset file and not necessarily on the reported line 
>>> or
>>> line before it.
>>>
>>> Is it a public dataset you are loading and if so can you provide a copy for
>>> local testing ?
>>>
>>> Best Regards
>>> Hugh Williams
>>> Professional Services
>>> OpenLink Software
>>> Home Page: http://www.openlinksw.com
>>> Community Support: https://community.openlinksw.com
>>> Weblogs (Blogs):
>>> Company Blog: https://medium.com/openlink-software-blog
>>> Virtuoso Blog: https://medium.com/virtuoso-blog
>>> Data Access Drivers
>>> Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
>>> LinkedIn -- http://www.linkedin.com/company/openlink-software/
>>> Twitter  -- http://twitter.com/OpenLink
>>> Google+  -- http://plus.google.com/100570109519069333827/
>>> Facebook -- http://www.facebook.com/OpenLinkSoftware
>>> Universal Data Access, Integration, and Management Technology Providers
>>>
>>>
>>>
>>>
 On 11 Dec 2018, at 17:45, Peter F. Patel-Schneider >>> > wrote:

 I'm loading a bunch of Turtle files and I'm getting the error

 2300 TURTLE RDF loader, line 1012: SR197: Non unique primary key on
 DB.DBA.RDF_LANGUAGE

 The line in question looks fine:

   "Wikimedia template"@ki,

 The line before it may indicate the issue

    "Wikimedia template"@kg,

 Nonetheless this should be valid RDF so there appears to be a bug in 
 Virtuoso
 here.

 Is there any workaround?


 This is in Virtuoso 07.20.3230.

 peter


 ___
 Virtuoso-users mailing list
 Virtuoso-users@lists.sourceforge.net
 
 https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>>
#!/usr/local/bin/python2.7 

for x in range (0,20) :
file = open('test{:0>2d}.ttl'.format(x),'w')

for k in range(0,10) :

file.write('2d}{:0>2d}> \n'.format(x,k))
for y in range (ord('a'),ord('z')+1) :
for z in range (ord('a'),ord('z')+1) :
		file.write('"description {:0>2d}{:0>3d}{:0>3d}"@l{:s}{:s},\n'.format(x,y,z,chr(y),chr(z)))
file.write('  "JUNK".\n')
file.close()


test.sh
Description: application/shellscript

Re: [Virtuoso-users] strange error when bulk-loading Turtle files

2018-12-18 Thread Peter F. Patel-Schneider
I did a bit of digging and it sure looks as if there is a race condition in
rdf_rl_lang_id in ttlpv.sql.   This code appears to check to see if the
language tag is already in DB.DBA.RDF_LANGUAGE and adds it if not.  But
another thread could do the same insert between the check and the insert, as
far as I can tell.

It looks to me as if the right solution is to do a soft insert and a
subsequent query instead of a hard insert.

However, I don't understand how locking works in SQL so there may be something
that prevents another thread from interfering.

peter


On 12/18/18 8:55 AM, Peter F. Patel-Schneider wrote:
> I'm loading the Turtle Wikidata RDF complete dump, split into pieces and
> loaded with 10 active readers.   About half the time the load fails with one
> or more of these errors.  The errors are always near the beginning of the
> load---in the first group of 10 files to be loaded and near the beginning of
> the files (generally in the first couple of hundred lines in a file of size
> well over 1 GB).  No errors occur for any files beyond the first ten.
> 
> I could provide the files, but they total to about 340GB.
> 
> It sure looks as if there is some sort of bug when loading RDF language-tagged
> strings, where a race condition means that two threads are trying to load the
> same language tag into DB.DBA.RDF_LANGUAGE.  This would explain why the
> problem occurs only at the beginning of the load, when the language tags are
> being added to DB.DBA.RDF_LANGUAGE, and not later.  It would also explain why
> the errors are different between different runs.  (The only other explanation
> would be hardware errors, but this doesn't seem to be viable.)
> 
> It seems to me that a quick patch for this problem would be to change the
> insert into a soft insert, but I don't know where to make this change in the 
> code.
> 
> peter
> 
> 
> 
> 
> On 12/11/18 7:11 PM, Hugh Williams wrote:
>> Hi Peter,
>>
>> The triple value do indeed appear to be valid, but the problem could be
>> somewhere else in the dataset file and not necessarily on the reported line 
>> or
>> line before it.
>>
>> Is it a public dataset you are loading and if so can you provide a copy for
>> local testing ?
>>
>> Best Regards
>> Hugh Williams
>> Professional Services
>> OpenLink Software
>> Home Page: http://www.openlinksw.com
>> Community Support: https://community.openlinksw.com
>> Weblogs (Blogs):
>> Company Blog: https://medium.com/openlink-software-blog
>> Virtuoso Blog: https://medium.com/virtuoso-blog
>> Data Access Drivers
>> Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
>> LinkedIn -- http://www.linkedin.com/company/openlink-software/
>> Twitter  -- http://twitter.com/OpenLink
>> Google+  -- http://plus.google.com/100570109519069333827/
>> Facebook -- http://www.facebook.com/OpenLinkSoftware
>> Universal Data Access, Integration, and Management Technology Providers
>>
>>
>>
>>
>>> On 11 Dec 2018, at 17:45, Peter F. Patel-Schneider >> > wrote:
>>>
>>> I'm loading a bunch of Turtle files and I'm getting the error
>>>
>>> 2300 TURTLE RDF loader, line 1012: SR197: Non unique primary key on
>>> DB.DBA.RDF_LANGUAGE
>>>
>>> The line in question looks fine:
>>>
>>>   "Wikimedia template"@ki,
>>>
>>> The line before it may indicate the issue
>>>
>>>    "Wikimedia template"@kg,
>>>
>>> Nonetheless this should be valid RDF so there appears to be a bug in 
>>> Virtuoso
>>> here.
>>>
>>> Is there any workaround?
>>>
>>>
>>> This is in Virtuoso 07.20.3230.
>>>
>>> peter
>>>
>>>
>>> ___
>>> Virtuoso-users mailing list
>>> Virtuoso-users@lists.sourceforge.net
>>> 
>>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>


___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users


Re: [Virtuoso-users] strange error when bulk-loading Turtle files

2018-12-18 Thread Peter F. Patel-Schneider
I'm loading the Turtle Wikidata RDF complete dump, split into pieces and
loaded with 10 active readers.   About half the time the load fails with one
or more of these errors.  The errors are always near the beginning of the
load---in the first group of 10 files to be loaded and near the beginning of
the files (generally in the first couple of hundred lines in a file of size
well over 1 GB).  No errors occur for any files beyond the first ten.

I could provide the files, but they total to about 340GB.

It sure looks as if there is some sort of bug when loading RDF language-tagged
strings, where a race condition means that two threads are trying to load the
same language tag into DB.DBA.RDF_LANGUAGE.  This would explain why the
problem occurs only at the beginning of the load, when the language tags are
being added to DB.DBA.RDF_LANGUAGE, and not later.  It would also explain why
the errors are different between different runs.  (The only other explanation
would be hardware errors, but this doesn't seem to be viable.)

It seems to me that a quick patch for this problem would be to change the
insert into a soft insert, but I don't know where to make this change in the 
code.

peter




On 12/11/18 7:11 PM, Hugh Williams wrote:
> Hi Peter,
> 
> The triple value do indeed appear to be valid, but the problem could be
> somewhere else in the dataset file and not necessarily on the reported line or
> line before it.
> 
> Is it a public dataset you are loading and if so can you provide a copy for
> local testing ?
> 
> Best Regards
> Hugh Williams
> Professional Services
> OpenLink Software
> Home Page: http://www.openlinksw.com
> Community Support: https://community.openlinksw.com
> Weblogs (Blogs):
> Company Blog: https://medium.com/openlink-software-blog
> Virtuoso Blog: https://medium.com/virtuoso-blog
> Data Access Drivers
> Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
> LinkedIn -- http://www.linkedin.com/company/openlink-software/
> Twitter  -- http://twitter.com/OpenLink
> Google+  -- http://plus.google.com/100570109519069333827/
> Facebook -- http://www.facebook.com/OpenLinkSoftware
> Universal Data Access, Integration, and Management Technology Providers
> 
> 
> 
> 
>> On 11 Dec 2018, at 17:45, Peter F. Patel-Schneider > > wrote:
>>
>> I'm loading a bunch of Turtle files and I'm getting the error
>>
>> 2300 TURTLE RDF loader, line 1012: SR197: Non unique primary key on
>> DB.DBA.RDF_LANGUAGE
>>
>> The line in question looks fine:
>>
>>   "Wikimedia template"@ki,
>>
>> The line before it may indicate the issue
>>
>>    "Wikimedia template"@kg,
>>
>> Nonetheless this should be valid RDF so there appears to be a bug in Virtuoso
>> here.
>>
>> Is there any workaround?
>>
>>
>> This is in Virtuoso 07.20.3230.
>>
>> peter
>>
>>
>> ___
>> Virtuoso-users mailing list
>> Virtuoso-users@lists.sourceforge.net
>> 
>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
> 


___
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users