Good Day,
My name is Jomari Peterson and I am relatively new to Semantic
Web Applications. My expertise is actually in process management and
strategy. However, I am trying to learn how to utilize Jena at a very fast
pace for a project that I want to develop for demonstration. I have done
majority of the tutorial work from the Jena site using a portion of RDF
data download from BaseKBs website. This data was modified from Freebase's
data dumps. I am really relying on amalgamating work that has been done
and the examples set by others.
This leads me to my problem today. However, first, I would like to
thank everyone that has even made this possible for me. I appreciate the
documentation that is out there. I have been taking notes as I have gleaned
information from the site and other sources, so once I reach a point where
I feel like I have gained the base knowledge, I can pass them along to
assist future developers. At this point, I am able to upload, query and
manipulate data from BaseKB. This is primarily due to the smaller file
sizes and it being divided into separate files making things more
manageable. They are in N-triple Syntax. I downloaded Freebase's recently
released RDF Data Dump and wanted to use it, since it was direct from the
source. I wanted to utilize it until I taught myself about TDB and Fuseki,
since the file is over 30GB.
After testing my BaseKB files, I was able to query the data and
manipulate it. However, when I went to upload the Freebase RDF datadump, I
received the following output.
<div class="mydiv" style="border:1px #000 solid"><textarea
style="width:100%;height:120px;border:2px solid black;padding:4px;">
22:57:16 INFO Fuseki :: [7] POST http://localhost:3030/ds/upload
22:57:16 INFO Fuseki :: [7] Upload: Filename:
freebase-rdf-2012-11-27-15-46.ttl,
Content-Type=application/octet-stream, Charset=null => Turtle
22:57:18 WARN Fuseki :: [line: 100091, col: 54] Bad
IRI: <http://ja.wikipedia.org/wiki/イヴ・サン=ローラン> Code: 47/NOT_NFKC in
PATH: The IRI is not in Unicode Normal Form KC.
22:57:18 WARN Fuseki :: [line: 100091, col: 54] Bad
IRI: <http://ja.wikipedia.org/wiki/イヴ・サン=ローラン> Code:
56/COMPATIBILITY_CHARACTER in PATH: TODO
22:57:19 WARN Fuseki :: [line: 182783, col: 33]
Language not valid: es-419
22:57:19 WARN Fuseki :: [line: 182874, col: 24]
Language not valid: es-419
22:57:19 WARN Fuseki :: [line: 230804, col: 54] Bad
IRI: <http://ja.wikipedia.org/wiki/エシュ=シュル=アルゼット> Code: 47/NOT_NFKC in
PATH: The IRI is not in Unicode Normal Form KC.
22:57:19 WARN Fuseki :: [line: 230804, col: 54] Bad
IRI: <http://ja.wikipedia.org/wiki/エシュ=シュル=アルゼット> Code:
56/COMPATIBILITY_CHARACTER in PATH: TODO
22:57:20 WARN Fuseki :: [line: 263095, col: 33]
Language not valid: es-419
22:57:20 WARN Fuseki :: [line: 263271, col: 24]
Language not valid: es-419
22:57:20 WARN Fuseki :: [line: 291130, col: 54] Bad
IRI: <http://rpggeek.com/rpg/426/Changeling: The Dreaming> Code:
17/WHITESPACE in PATH: A single whitespace character. These match no
grammar rules of URIs/IRIs. These characters are permitted in RDF URI
References, XML system identifiers, and XML Schema anyURIs.
22:57:20 WARN Fuseki :: [line: 298926, col: 36] Bad
IRI: <http://http:urbis.com> Code: 0/ILLEGAL_CHARACTER in PORT: The
character violates the grammar rules for URIs/IRIs.
22:57:22 WARN Fuseki :: [line: 320172, col: 55] Bad
IRI: <http://pt.wikipedia.org/wiki/Estudo_Transcendental_Nº12> Code:
47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
22:57:22 WARN Fuseki :: [line: 320172, col: 55] Bad
IRI: <http://pt.wikipedia.org/wiki/Estudo_Transcendental_Nº12> Code:
56/COMPATIBILITY_CHARACTER in PATH: TODO
22:57:22 WARN Fuseki :: [line: 331805, col: 47] Bad
IRI: <http://www.skygate-int.com/ (defunct)> Code: 17/WHITESPACE in
PATH: A single whitespace character. These match no grammar rules of
URIs/IRIs. These characters are permitted in RDF URI References, XML
system identifiers, and XML Schema anyURIs.
22:57:22 WARN Fuseki :: [line: 334838, col: 55] Bad
IRI: <http://de.wikipedia.org/wiki/Éclairs_sur_l’Au-delà_…> Code:
47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
22:57:22 WARN Fuseki :: [line: 334838, col: 55] Bad
IRI: <http://de.wikipedia.org/wiki/Éclairs_sur_l’Au-delà_…> Code:
56/COMPATIBILITY_CHARACTER in PATH: TODO
22:57:22 ERROR Fuseki :: [line: 338972, col: 114] Broken
IRI (bad character: '<'):
http://www.nhs.uk/Services/Hospitals/Overview/DefaultView.aspx?id
22:57:22 INFO Fuseki :: [7] 500 Server Error
</textarea></div>
I did learn about URIs in the process of my work and I am assuming from my
reading that IRIs are the international expansion of them to include
additional characters. I don't know if the ViolationsCodes section would
have helped or not, but it is currently not available in the "Support for
Internationalised Resource Identifiers in Jena" on the apache site.
*Now the question(s): Is there a way to skip these types of errors and
continue with the rest of the dataset/file? Would it be better to try and
load it into TDB first? Probably outside scope of this work: but how could
I go about fixing/deleting this broken IRI character in such a big file? I
appreciate your time and help.*
*
*
(Freebase Users informed me there are about 7800 of these types of errors
in this datadump, so advice on what I would need to do to delete them or
skip them would be appreciated. The BaseKB dump does not have this issue
but it is spread across 1000+ different files from what the owner of that
dump told me)
I manually made the Freebase Datadump .ttl because they stated in their
documentation that it was turtle syntax. The beginning of the data looks
like so...
<div class="mydiv" style="border:1px #000 solid"><textarea
style="width:100%;height:120px;border:2px solid black;padding:4px;">
@prefix ns: <http://rdf.freebase.com/ns/>.
@prefix key: <http://rdf.freebase.com/key/>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
ns:american_football.football_historical_roster_position.number
ns:type.object.name "Number"@en.
ns:american_football.football_historical_roster_position.number
ns:type.property.unique true.
ns:american_football.football_historical_roster_position.number
ns:type.object.type ns:type.property.
ns:american_football.football_historical_roster_position.number rdfs:label
"Number"@en.
ns:american_football.football_historical_roster_position.number
ns:type.property.expected_type ns:type.int.
ns:american_football.football_historical_roster_position.number
ns:type.property.schema
ns:american_football.football_historical_roster_position.
ns:american_football.football_historical_roster_position.number rdf:type
owl:FunctionalProperty.
ns:american_football.football_historical_roster_position.number rdfs:domain
ns:american_football.football_historical_roster_position.
ns:american_football.football_historical_roster_position.number rdfs:range
ns:type.int.
ns:american_football.football_player.footballdb_id
ns:type.property.expected_type ns:type.enumeration.
ns:american_football.football_player.footballdb_id ns:type.object.type
ns:type.property.
ns:american_football.football_player.footballdb_id ns:type.property.unique
true.
ns:american_football.football_player.footballdb_id ns:type.property.schema
ns:american_football.football_player.
ns:american_football.football_player.footballdb_id rdfs:label
"footballdb
ID"@en.
ns:american_football.football_player.footballdb_id ns:type.object.name
"footballdb
ID"@en.
ns:american_football.football_player.footballdb_id rdf:type
owl:FunctionalProperty.
ns:american_football.football_player.footballdb_id rdfs:domain
ns:american_football.football_player.
ns:american_football.football_player.footballdb_id rdfs:range
ns:type.enumeration.
ns:astronomy.astronomical_observatory.discoveries
ns:type.property.expected_type</textarea></div>
--
Jomari Peterson
"Creating the Context for Miracles"
707-373-1093