Dear Apache Jena Users,

you'll find this mail also as
https://stackoverflow.com/questions/63486767/how-can-i-get-the-fuseki-api-via-sparqlwrapper-to-properly-report-a-detailed-err

in the last few weeks i tried out some graph databases in the python
environment. Namely:

- weaviate see http://wiki.bitplan.com/index.php/Weaviate

- dgraph http://wiki.bitplan.com/index.php/Dgraph

- ruruki https://pypi.org/project/ruruki/

and created a test project documented at
http://wiki.bitplan.com/index.php/DgraphAndWeaviateTest and open source at:
https://github.com/WolfgangFahl/DgraphAndWeaviateTest

After some ups and downs in the evaluation process i decided to try out
Apache Jena / Fuseki /SPARQL as an alternative and added:

https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/storage/sparql.py
and
https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testSPARQL.py

to allow for a "round trip" operation between python list of dicts and
Jena/SPARQL based storage.

The approach performs very well for my usecase and after trying it out
for a while i get into more details that need to be addressed.

The stackoverflow question
https://stackoverflow.com/questions/63435157/listofdict-to-rdf-conversion-in-python-targeting-apache-jena-fuseki/63440396#63440396
addresses the initial issues and
https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues?q=is%3Aissue+is%3Aclosed
issues 2-5 show some detail problems that were already fixed.

Now I am working with some 180000 records i'd like to import from 6
different data sources and each data source seems to have new exotic records
that make the approach fail.

E.g. one batch of records gives me the following log:

read 45601 events in   0.6 s
storing 45601 events to sparql
  batch for         1 -      2000 of     45601 cr:Event in    0.6 s
->    0.6 s
  batch for      2001 -      4000 of     45601 cr:Event in    0.5 s
->    1.1 s
  batch for      4001 -      6000 of     45601 cr:Event in    0.5 s
->    1.6 s
  batch for      6001 -      8000 of     45601 cr:Event in    0.5 s
->    2.1 s
  batch for      8001 -     10000 of     45601 cr:Event in    0.5 s
->    2.6 s
  batch for     10001 -     12000 of     45601 cr:Event in    0.7 s
->    3.2 s
======================================================================
ERROR: testCrossref (tests.test_Crossref.TestCrossref)
test loading crossref data
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"/Users/wf/Library/Python/3.8/lib/python/site-packages/SPARQLWrapper/Wrapper.py",
line 1073, in _query
    response = urlopener(request)
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py",
line 222, in urlopen
    return opener.open(url, data, timeout)
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py",
line 531, in open
    response = meth(req, response)
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py",
line 640, in http_response
    response = self.parent.error(
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py",
line 569, in error
    return self._call_chain(*args)
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py",
line 502, in _call_chain
    result = func(*args)
  File
"/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py",
line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

SPARQLWrapper.SPARQLExceptions.QueryBadFormed: QueryBadFormed: a bad
request has been sent to the endpoint, probably the sparql query is bad
formed.

Response:
b'Error 400: Bad Request\n'

Now since I don't get any details on what the problem is i am working
with a binary search. With the error above i only know the problem
is with a record with a batchIndex between 12000 and 14000 so I am .
setting the limit to 14000 and batchSize to 100 to get closer.

 batch for     13301 -     13400 of     14000 cr:Event in    0.0 s ->   
4.3 s

is now the last successful batch. So i am using a binary search: 13450
fail, 13425 fail, 13412 ok, 13418 ok, 13422 fail, 13420 ok, 13421 ok
So record 13422 is the culprit and I switch on debug mode to see the
INSERT Data created for the record:

  cr:Event__102140gtm20003 cr:Event_name "Higher local fields".
  cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany".
  cr:Event__102140gtm20003 cr:Event_source "crossref".
  cr:Event__102140gtm20003 cr:Event_eventId "10.2140/gtm.2000.3".
  cr:Event__102140gtm20003 cr:Event_title "Invitation to higher local
fields".
  cr:Event__102140gtm20003 cr:Event_startDate
"1999-08-29"^^<http://www.w3.org/2001/XMLSchema#date>.
  cr:Event__102140gtm20003 cr:Event_year 1999.
  cr:Event__102140gtm20003 cr:Event_month 9.
  cr:Event__102140gtm20003 cr:Event_endDate
"1999-09-05"^^<http://www.w3.org/2001/XMLSchema#date>.

So the Umlaut-encoding "\\u" in the location "Münster" is the culprit
here. I will work around this issue. The real question is:

*How can i get the Fuseki API via SPARQLWrapper to properly report a
detailed error message e.g. with something like "error in line #
cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany". is 
not a valid triple?**
*


Yours

   Wolfgang

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, 
Geschäftsführer: Wolfgang Fahl 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to