Dear Apache Jena Users, you'll find this mail also as https://stackoverflow.com/questions/63486767/how-can-i-get-the-fuseki-api-via-sparqlwrapper-to-properly-report-a-detailed-err
in the last few weeks i tried out some graph databases in the python environment. Namely: - weaviate see http://wiki.bitplan.com/index.php/Weaviate - dgraph http://wiki.bitplan.com/index.php/Dgraph - ruruki https://pypi.org/project/ruruki/ and created a test project documented at http://wiki.bitplan.com/index.php/DgraphAndWeaviateTest and open source at: https://github.com/WolfgangFahl/DgraphAndWeaviateTest After some ups and downs in the evaluation process i decided to try out Apache Jena / Fuseki /SPARQL as an alternative and added: https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/storage/sparql.py and https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testSPARQL.py to allow for a "round trip" operation between python list of dicts and Jena/SPARQL based storage. The approach performs very well for my usecase and after trying it out for a while i get into more details that need to be addressed. The stackoverflow question https://stackoverflow.com/questions/63435157/listofdict-to-rdf-conversion-in-python-targeting-apache-jena-fuseki/63440396#63440396 addresses the initial issues and https://github.com/WolfgangFahl/DgraphAndWeaviateTest/issues?q=is%3Aissue+is%3Aclosed issues 2-5 show some detail problems that were already fixed. Now I am working with some 180000 records i'd like to import from 6 different data sources and each data source seems to have new exotic records that make the approach fail. E.g. one batch of records gives me the following log: read 45601 events in 0.6 s storing 45601 events to sparql batch for 1 - 2000 of 45601 cr:Event in 0.6 s -> 0.6 s batch for 2001 - 4000 of 45601 cr:Event in 0.5 s -> 1.1 s batch for 4001 - 6000 of 45601 cr:Event in 0.5 s -> 1.6 s batch for 6001 - 8000 of 45601 cr:Event in 0.5 s -> 2.1 s batch for 8001 - 10000 of 45601 cr:Event in 0.5 s -> 2.6 s batch for 10001 - 12000 of 45601 cr:Event in 0.7 s -> 3.2 s ====================================================================== ERROR: testCrossref (tests.test_Crossref.TestCrossref) test loading crossref data ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/wf/Library/Python/3.8/lib/python/site-packages/SPARQLWrapper/Wrapper.py", line 1073, in _query response = urlopener(request) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(*args) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request SPARQLWrapper.SPARQLExceptions.QueryBadFormed: QueryBadFormed: a bad request has been sent to the endpoint, probably the sparql query is bad formed. Response: b'Error 400: Bad Request\n' Now since I don't get any details on what the problem is i am working with a binary search. With the error above i only know the problem is with a record with a batchIndex between 12000 and 14000 so I am . setting the limit to 14000 and batchSize to 100 to get closer. batch for 13301 - 13400 of 14000 cr:Event in 0.0 s -> 4.3 s is now the last successful batch. So i am using a binary search: 13450 fail, 13425 fail, 13412 ok, 13418 ok, 13422 fail, 13420 ok, 13421 ok So record 13422 is the culprit and I switch on debug mode to see the INSERT Data created for the record: cr:Event__102140gtm20003 cr:Event_name "Higher local fields". cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany". cr:Event__102140gtm20003 cr:Event_source "crossref". cr:Event__102140gtm20003 cr:Event_eventId "10.2140/gtm.2000.3". cr:Event__102140gtm20003 cr:Event_title "Invitation to higher local fields". cr:Event__102140gtm20003 cr:Event_startDate "1999-08-29"^^<http://www.w3.org/2001/XMLSchema#date>. cr:Event__102140gtm20003 cr:Event_year 1999. cr:Event__102140gtm20003 cr:Event_month 9. cr:Event__102140gtm20003 cr:Event_endDate "1999-09-05"^^<http://www.w3.org/2001/XMLSchema#date>. So the Umlaut-encoding "\\u" in the location "Münster" is the culprit here. I will work around this issue. The real question is: *How can i get the Fuseki API via SPARQLWrapper to properly report a detailed error message e.g. with something like "error in line # cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany". is not a valid triple?** * Yours Wolfgang -- BITPlan - smart solutions Wolfgang Fahl Pater-Delp-Str. 1, D-47877 Willich Schiefbahn Tel. +49 2154 811-480, Fax +49 2154 811-481 Web: http://www.bitplan.de BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
signature.asc
Description: OpenPGP digital signature