Yes I am using the same data but of Feb, 2018 as I started experimenting that
time. For example for the following piece of code I am getting the error as
shown below.
public class ReadQuadInJena {
public static void main(String[] args) {
// TODO Auto-generated method stub
String FileName = "/home/samita/Dyldo_DS_4Feb2018/data.nq";
DatasetGraph dsg = RDFDataMgr.loadDatasetGraph(FileName);
//System.out.println(node);
Iterator<Quad> iterQuad = dsg.find();
while(iterQuad.hasNext()){
System.out.println(iterQuad.next());
}
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 89841,
col: 232] Illegal character in IRI (codepoint 0x7C, '|'):
<http://fonts.googleapis.com/css?family=Nunito[|]...>
at
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:67)
at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:54)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:195)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:334)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:303)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:277)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498)
at org.apache.jena.riot.RDFDataMgr.parseFromURI(RDFDataMgr.java:890)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:519)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:486)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:439)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:419)
at org.apache.jena.riot.RDFDataMgr.loadDatasetGraph(RDFDataMgr.java:392)
at ldbqPack.ReadQuadInJena.main(ReadQuadInJena.java:19)
________________________________
From: Andy Seaborne <[email protected]>
Sent: 16 April 2018 22:13:36
To: [email protected]
Subject: Re: TDB 2 Store Parameters
I downlaoded
http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz
(the latest I could find)
and used tdblaoder.
Is that the data you are using?
Andy
On 16/04/18 17:32, ajs6f wrote:
> You should be able to check the validity of any of your files just by running
> them through Jena's `riot` command.
>
> You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or
> `tdb2.tdbloader` commands.
>
> ajs6f
>
>> On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus
>> <[email protected]> wrote:
>>
>> OK Andy I got your point. Can you please share the code that you used to
>> read the Dynamic Linked Data Observatory dataset?
>>
>>
>>
>> Regards,
>>
>> Samita Bai
>>
>> ________________________________
>> From: Andy Seaborne <[email protected]>
>> Sent: 16 April 2018 15:34:07
>> To: [email protected]
>> Subject: Re: TDB 2 Store Parameters
>>
>> If you wish to prcoess the data as it is parsed, then see StreamRDF and
>> either
>>
>> NxParser, which is not part of Jena, is not a validating parser.
>>
>> If the data is not valid, then you will have problems at some point,
>> either loading, querying or outputting later.
>>
>> Adam has explained that TDB2 inxexes heavily so that querying is well
>> severed.
>>
>> We can't help with the parser errors without knowing what they are.
>>
>> Which files from Dynamic Linked Data Observatory are you processing?
>> Don't the later ones replace the earlier ones?
>>
>> I found that the last n-quads file was 42 million triples and all valid.
>>
>> Andy
>>
>> On 16/04/18 11:05, ajs6f wrote:
>>> Is there are syntax errors in your RDF (and it sounds like that is why Jena
>>> will not read it directly) you are doing yourself no service by taking
>>> unusual pains to force TDB to ingest your data.
>>>
>>> Please show us the errors that Jena is throwing trying to read your data
>>> and an appropriate sample of the data in question.
>>>
>>>
>>> ajs6f
>>>
>>>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus
>>>> <[email protected]> wrote:
>>>>
>>>> In addition to previous query. It is taking a lot of time to first parse
>>>> the dataset using NXParser then checking for object, and creating quad
>>>> again and storing in TDB. It could be very simple if we can take the quad
>>>> check its object and insert it in TDB.
>>>>
>>>>
>>>> But Jena is not helping me with this 😞
>>>>
>>>>
>>>> So I have to create quads again and store it in TDB.
>>>>
>>>>
>>>> Any help is surely appreciated.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Samita Bai
>>>>
>>>> ________________________________
>>>> From: Samita Bai / PhD CS Scholar @ City Campus
>>>> Sent: 16 April 2018 13:33:51
>>>> To: [email protected]
>>>> Subject: Re: TDB 2 Store Parameters
>>>>
>>>>
>>>> Thank you Andy and Adam for the help. Actually, I am just indexing the
>>>> quads where object is either literal or foreign URI (i.e. Object belonging
>>>> to different dataset than subject), I am using NXParser (as Jena is giving
>>>> various parsing errors) to parse the dataset and then I am storing in TDB2
>>>> in the following manner.
>>>>
>>>>
>>>>
>>>> public void SetQuadsList(String sub, String pred, String obj, String
>>>> context) {
>>>> Node subjects = NodeFactory.createURI(sub);
>>>> Node objects = NodeFactory.createURI(obj);
>>>> Node contexts =NodeFactory.createURI(context);
>>>> //Node rdfSeeAlso = RDFS.seeAlso.asNode();
>>>>
>>>> Node predicates =NodeFactory.createURI(pred);
>>>>
>>>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects);
>>>>
>>>> Quad quads = Quad.create(contexts, subjects, predicates, objects);
>>>>
>>>> QuadList.add(quads);
>>>>
>>>> //System.out.println("Number of backlinks:" + QuadList.size());
>>>>
>>>> //System.out.println("quad written");
>>>>
>>>> //System.out.println("Quad"+quads.toString());
>>>>
>>>> }
>>>> public List<Quad> GetQuadsList(){
>>>> return QuadList;
>>>> }
>>>> public void QuadsToTDB(List<Quad> quadList) {
>>>> final String DATASET_DIR_NAME = "DyLDO1000K_Index";
>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME );
>>>>
>>>>
>>>> dataset.begin ( ReadWrite.WRITE );
>>>> try {
>>>> DatasetGraph dsg = dataset.asDatasetGraph();
>>>> Iterator<Quad> quads = quadList.iterator();
>>>> System.out.println("Size of Quad List: "+quadList.size());
>>>> while ( quads.hasNext() ) {
>>>> //System.out.println("here");
>>>> Quad quad = quads.next();
>>>> dsg.add(quad);
>>>> //System.out.println(quad.toString()+ "added");
>>>> //RDFDataMgr.writeQuads(System.out, quads);
>>>> // RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>
>>>> }
>>>> System.out.println("dsg created of size "+dsg.size());
>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>> System.out.println("written dsg using datamgr.");
>>>>
>>>>
>>>> //System.out.println(dataset.isEmpty());
>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>> dataset.commit();
>>>>
>>>> System.out.println("committed dataset.");
>>>>
>>>>
>>>> } catch ( Exception e ) {
>>>> e.printStackTrace(System.err);
>>>> //dataset.abort();
>>>> } finally {
>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>> dataset.end();
>>>>
>>>> }
>>>> System.out.println("end method.");
>>>> }}
>>>>
>>>>
>>>> I have indexed 40,000 files (as I have spilited the dataset into files
>>>> according to context) and the index size has become 120 GB. I have a total
>>>> of 1,35,600 files whose total size is 19.8 GB only.
>>>>
>>>>
>>>> Why the TDB is making such BIG index size. I am confused :( is there any
>>>> problem in my code.
>>>>
>>>>
>>>> Please suggest me if there can be some improvements.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Samita Bai
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: ajs6f <[email protected]>
>>>> Sent: 15 April 2018 03:07:59
>>>> To: [email protected]
>>>> Subject: Re: TDB 2 Store Parameters
>>>>
>>>> 42 million quads is nothing like so many that either TDB version should
>>>> have any problem doing normal indexing (assuming very little in the way of
>>>> hardware-- I ingest datasets like that on my laptop all the time).
>>>>
>>>> Do you have some extraordinary hardware limitations?
>>>>
>>>> Adam
>>>>
>>>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <[email protected]> wrote:
>>>>>
>>>>> Hi Samita,
>>>>>
>>>>> Firstly - as Adam points out - if theer are no indexes then access to the
>>>>> data will be very slow. For a GSPO index, that means squeries must be
>>>>> "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }".
>>>>>
>>>>> GSPO means lookup by G then S within those G and the same for P then O.
>>>>>
>>>>> I looked at the data and it seems to be able 42 million quads.
>>>>>
>>>>> Using TDB1 (the loader is faster at this scale currently) is likely to be
>>>>> a better choice.
>>>>>
>>>>> Looking at StoreParams in TDB2:
>>>>>
>>>>> The code below creates the database at TDB2Factory.connectDataset so any
>>>>> StoreParams after that do not affect indexing.
>>>>>
>>>>> I tried to make it work in the release but the code ignores provided
>>>>> StoreParams - sorry. Even if it did work, it hits a test to make sure
>>>>> there are basic indexing (Adam's point).
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 13/04/18 13:42, Samita Bai / PhD CS Scholar @ City Campus wrote:
>>>>>> I wrote the following code to build only one type of triple and quad
>>>>>> index but it is still creating all indexes 😞
>>>>>> package ldbqPack;
>>>>>> import org.apache.jena.query.Dataset;
>>>>>> import org.apache.jena.tdb2.TDB2Factory;
>>>>>> import org.apache.jena.tdb2.setup.StoreParams;
>>>>>> import org.apache.jena.tdb2.sys.DatabaseConnection;
>>>>>> import org.apache.jena.dboe.base.block.FileMode;
>>>>>> import org.apache.jena.dboe.base.file.Location;
>>>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory;
>>>>>> public class StrPrms {
>>>>>> static String[] tindexes= {"SPO"};
>>>>>> static String[] qindexes= {"GSPO"};
>>>>>> static String[] pindexes= {"GPU"};
>>>>>> static final StoreParams pApp = StoreParams.builder()
>>>>>> .blockSize(12) // Not dynamic
>>>>>> .nodeMissCacheSize(12) // Dynamic
>>>>>> .build();
>>>>>> static final StoreParams pLoc = StoreParams.builder()
>>>>>> .blockSize(0)
>>>>>> .nodeMissCacheSize(0).build();
>>>>>> static final StoreParams pDft = StoreParams.builder()
>>>>>> .fileMode(FileMode.mapped)
>>>>>> .blockSize(8192)
>>>>>> .blockReadCacheSize(5000)
>>>>>> .blockWriteCacheSize(1000)
>>>>>> .node2NodeIdCacheSize(200000)
>>>>>> .nodeId2NodeCacheSize(750000)
>>>>>> .nodeMissCacheSize(1000)
>>>>>> .nodeTableBaseName("nodes")
>>>>>> .primaryIndexTriples("SPO")
>>>>>> .tripleIndexes(tindexes)
>>>>>> .primaryIndexQuads("GSPO")
>>>>>> .quadIndexes(qindexes)
>>>>>> .prefixTableBaseName("prefixes")
>>>>>> .primaryIndexPrefix("GPU")
>>>>>> .prefixIndexes(pindexes)
>>>>>> .build();
>>>>>> public static void main(String[] args) {
>>>>>> // TODO Auto-generated method stub
>>>>>> final String DATASET_DIR_NAME = "DyLDO100";
>>>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME
>>>>>> );
>>>>>> Location location = Location.create(DATASET_DIR_NAME);
>>>>>> StoreParams custom_params =
>>>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc, pDft);
>>>>>> DatabaseConnection.connectCreate(location, custom_params);
>>>>>> StoreParams params = StoreParams.getSmallStoreParams();
>>>>>> System.out.println(params);
>>>>>> }
>>>>>> }
>>>>>> Please help.
>>>>>> Regards,
>>>>>> Samita Bai
>>>>>> ________________________________
>>>>>> P : Please consider the environment before printing this e-mail
>>>>>> ________________________________
>>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
>>>>>> contain confidential and privileged information. If you are not the
>>>>>> intended recipient, please notify the sender immediately by return
>>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or
>>>>>> use of this information by a person other than the intended recipient is
>>>>>> unauthorized and may be illegal.
>>>>>> ________________________________
>>>>
>>>>
>>>> P : Please consider the environment before printing this e-mail
>>>>
>>>> ________________________________
>>>>
>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
>>>> contain confidential and privileged information. If you are not the
>>>> intended recipient, please notify the sender immediately by return e-mail,
>>>> delete this e-mail and destroy any copies. Any dissemination or use of
>>>> this information by a person other than the intended recipient is
>>>> unauthorized and may be illegal.
>>>>
>>>> ________________________________
>>>
>>
>> P : Please consider the environment before printing this e-mail
>>
>> ________________________________
>>
>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
>> contain confidential and privileged information. If you are not the intended
>> recipient, please notify the sender immediately by return e-mail, delete
>> this e-mail and destroy any copies. Any dissemination or use of this
>> information by a person other than the intended recipient is unauthorized
>> and may be illegal.
>>
>> ________________________________
>
P : Please consider the environment before printing this e-mail
________________________________
CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
contain confidential and privileged information. If you are not the intended
recipient, please notify the sender immediately by return e-mail, delete this
e-mail and destroy any copies. Any dissemination or use of this information by
a person other than the intended recipient is unauthorized and may be illegal.
________________________________