Re: TDB 2 Store Parameters

Samita Bai / PhD CS Scholar @ City Campus Mon, 16 Apr 2018 14:14:37 -0700

Dear Andy,


I downloaded the same dataset from the link as you told i.e.


http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz


Then I extracted and ran the following code


public class ReadQuadInJena {

public static void main(String[] args) {
// TODO Auto-generated method stub
TDBLoader tlobj= new TDBLoader();
String Ds ="/home/samita/data.nq";
Location location = Location.create("/home/samita/Load_TDB");
DatasetGraphTDB dgtdb = DatasetBuilderStd.create(location);
try {
InputStream is = new FileInputStream(AndyDs);
tlobj.loadDataset(dgtdb, is);
}catch(FileNotFoundException e) {}
}

It ended up with this error.

Exception in thread "main" org.apache.jena.riot.RiotException: [line: 30506, 
col: 232] Illegal character in IRI (codepoint 0x7C, '|'): 
<http://fonts.googleapis.com/css?family=Nunito[|]...>
at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:67)
at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:54)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:195)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:334)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:324)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:273)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498)
at org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:870)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:693)
at 
org.apache.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:152)
at 
org.apache.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:115)
at org.apache.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:256)
at org.apache.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:191)
at ldbqPack.ReadQuadInJena.main(ReadQuadInJena.java:47)

If it was running fine at your end what's wrong with my code. Please help me.





________________________________
From: Andy Seaborne <a...@apache.org>
Sent: 16 April 2018 22:13:36
To: users@jena.apache.org
Subject: Re: TDB 2 Store Parameters

I downlaoded

http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz

(the latest I could find)

and used tdblaoder.

Is that the data you are using?

     Andy

On 16/04/18 17:32, ajs6f wrote:
> You should be able to check the validity of any of your files just by running 
> them through Jena's `riot` command.
>
> You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or 
> `tdb2.tdbloader` commands.
>
> ajs6f
>
>> On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus 
>> <s...@iba.edu.pk> wrote:
>>
>> OK Andy I got your point. Can you please share the code that you used to 
>> read the Dynamic Linked Data Observatory dataset?
>>
>>
>>
>> Regards,
>>
>> Samita Bai
>>
>> ________________________________
>> From: Andy Seaborne <a...@apache.org>
>> Sent: 16 April 2018 15:34:07
>> To: users@jena.apache.org
>> Subject: Re: TDB 2 Store Parameters
>>
>> If you wish to prcoess the data as it is parsed, then see StreamRDF and
>> either
>>
>> NxParser, which is not part of Jena, is not a validating parser.
>>
>> If the data is not valid, then you will have problems at some point,
>> either loading, querying or outputting later.
>>
>> Adam has explained that TDB2 inxexes heavily so that querying is well
>> severed.
>>
>> We can't help with the parser errors without knowing what they are.
>>
>> Which files from Dynamic Linked Data Observatory are you processing?
>> Don't the later ones replace the earlier ones?
>>
>> I found that the last n-quads file was 42 million triples and all valid.
>>
>>      Andy
>>
>> On 16/04/18 11:05, ajs6f wrote:
>>> Is there are syntax errors in your RDF (and it sounds like that is why Jena 
>>> will not read it directly) you are doing yourself no service by taking 
>>> unusual pains to force TDB to ingest your data.
>>>
>>> Please show us the errors that Jena is throwing trying to read your data 
>>> and an appropriate sample of the data in question.
>>>
>>>
>>> ajs6f
>>>
>>>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus 
>>>> <s...@iba.edu.pk> wrote:
>>>>
>>>> In addition to previous query. It is taking a lot of time to first parse 
>>>> the dataset using NXParser then checking for object, and creating quad 
>>>> again and storing in TDB. It could be very simple if we can take the quad 
>>>> check its object and insert it in TDB.
>>>>
>>>>
>>>> But Jena is not helping me with this 😞
>>>>
>>>>
>>>> So I have to create quads again and store it in TDB.
>>>>
>>>>
>>>> Any help is surely appreciated.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Samita Bai
>>>>
>>>> ________________________________
>>>> From: Samita Bai / PhD CS Scholar @ City Campus
>>>> Sent: 16 April 2018 13:33:51
>>>> To: users@jena.apache.org
>>>> Subject: Re: TDB 2 Store Parameters
>>>>
>>>>
>>>> Thank you Andy and Adam for the help. Actually, I am just indexing the 
>>>> quads where object is either literal or foreign URI (i.e. Object belonging 
>>>> to different dataset than subject), I am using NXParser (as Jena is giving 
>>>> various parsing errors) to parse the dataset and then I am storing in TDB2 
>>>> in the following manner.
>>>>
>>>>
>>>>
>>>> public  void SetQuadsList(String sub, String pred, String obj, String 
>>>> context) {
>>>> Node subjects = NodeFactory.createURI(sub);
>>>> Node objects = NodeFactory.createURI(obj);
>>>> Node contexts =NodeFactory.createURI(context);
>>>> //Node rdfSeeAlso = RDFS.seeAlso.asNode();
>>>>
>>>> Node predicates =NodeFactory.createURI(pred);
>>>>
>>>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects);
>>>>
>>>> Quad quads = Quad.create(contexts, subjects, predicates, objects);
>>>>
>>>> QuadList.add(quads);
>>>>
>>>> //System.out.println("Number of backlinks:" + QuadList.size());
>>>>
>>>> //System.out.println("quad written");
>>>>
>>>> //System.out.println("Quad"+quads.toString());
>>>>
>>>> }
>>>> public List<Quad> GetQuadsList(){
>>>> return QuadList;
>>>> }
>>>> public void QuadsToTDB(List<Quad> quadList) {
>>>> final String DATASET_DIR_NAME = "DyLDO1000K_Index";
>>>>         Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME );
>>>>
>>>>
>>>>         dataset.begin ( ReadWrite.WRITE );
>>>>         try {
>>>>         DatasetGraph dsg = dataset.asDatasetGraph();
>>>>             Iterator<Quad> quads = quadList.iterator();
>>>>             System.out.println("Size of Quad List: "+quadList.size());
>>>>             while ( quads.hasNext() ) {
>>>>             //System.out.println("here");
>>>>                 Quad quad = quads.next();
>>>>                 dsg.add(quad);
>>>>                 //System.out.println(quad.toString()+ "added");
>>>>                 //RDFDataMgr.writeQuads(System.out, quads);
>>>>               //  RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>
>>>>             }
>>>>             System.out.println("dsg created of size "+dsg.size());
>>>>             //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>             System.out.println("written dsg using datamgr.");
>>>>
>>>>
>>>>             //System.out.println(dataset.isEmpty());
>>>>             //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>             dataset.commit();
>>>>
>>>>             System.out.println("committed dataset.");
>>>>
>>>>
>>>>         } catch ( Exception e ) {
>>>>             e.printStackTrace(System.err);
>>>>             //dataset.abort();
>>>>         } finally {
>>>>         //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>             dataset.end();
>>>>
>>>>         }
>>>>         System.out.println("end method.");
>>>> }}
>>>>
>>>>
>>>> I have indexed 40,000 files (as I have spilited the dataset into files 
>>>> according to context) and the index size has become 120 GB. I have a total 
>>>> of 1,35,600 files whose total size is 19.8 GB only.
>>>>
>>>>
>>>> Why the TDB is making such BIG index size. I am confused :( is there any 
>>>> problem in my code.
>>>>
>>>>
>>>> Please suggest me if there can be some improvements.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Samita Bai
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: ajs6f <aj...@apache.org>
>>>> Sent: 15 April 2018 03:07:59
>>>> To: users@jena.apache.org
>>>> Subject: Re: TDB 2 Store Parameters
>>>>
>>>> 42 million quads is nothing like so many that either TDB version should 
>>>> have any problem doing normal indexing (assuming very little in the way of 
>>>> hardware-- I ingest datasets like that on my laptop all the time).
>>>>
>>>> Do you have some extraordinary hardware limitations?
>>>>
>>>> Adam
>>>>
>>>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <a...@apache.org> wrote:
>>>>>
>>>>> Hi Samita,
>>>>>
>>>>> Firstly - as Adam points out - if theer are no indexes then access to the 
>>>>> data will be very slow.  For a GSPO index,  that means squeries must be 
>>>>> "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }".
>>>>>
>>>>> GSPO means lookup by G then S within those G and the same for P then O.
>>>>>
>>>>> I looked at the data and it seems to be able 42 million quads.
>>>>>
>>>>> Using TDB1 (the loader is faster at this scale currently) is likely to be 
>>>>> a better choice.
>>>>>
>>>>> Looking at StoreParams in TDB2:
>>>>>
>>>>> The code below creates the database at TDB2Factory.connectDataset so any 
>>>>> StoreParams after that do not affect indexing.
>>>>>
>>>>> I tried to make it work in the release but the code ignores provided 
>>>>> StoreParams - sorry.  Even if it did work, it hits a test to make sure 
>>>>> there are basic indexing (Adam's point).
>>>>>
>>>>>    Andy
>>>>>
>>>>>
>>>>> On 13/04/18 13:42, Samita Bai  / PhD CS Scholar @ City Campus wrote:
>>>>>> I wrote the following code to build only one type of triple and quad 
>>>>>> index but it is still creating all indexes 😞
>>>>>> package ldbqPack;
>>>>>> import org.apache.jena.query.Dataset;
>>>>>> import org.apache.jena.tdb2.TDB2Factory;
>>>>>> import org.apache.jena.tdb2.setup.StoreParams;
>>>>>> import org.apache.jena.tdb2.sys.DatabaseConnection;
>>>>>> import org.apache.jena.dboe.base.block.FileMode;
>>>>>> import org.apache.jena.dboe.base.file.Location;
>>>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory;
>>>>>> public class StrPrms {
>>>>>> static String[] tindexes= {"SPO"};
>>>>>> static String[] qindexes= {"GSPO"};
>>>>>> static String[] pindexes= {"GPU"};
>>>>>> static final StoreParams pApp = StoreParams.builder()
>>>>>>        .blockSize(12)              // Not dynamic
>>>>>>        .nodeMissCacheSize(12)      // Dynamic
>>>>>>        .build();
>>>>>>    static final StoreParams pLoc = StoreParams.builder()
>>>>>>        .blockSize(0)
>>>>>>        .nodeMissCacheSize(0).build();
>>>>>>    static final StoreParams pDft = StoreParams.builder()
>>>>>>     .fileMode(FileMode.mapped)
>>>>>>     .blockSize(8192)
>>>>>>     .blockReadCacheSize(5000)
>>>>>>     .blockWriteCacheSize(1000)
>>>>>>     .node2NodeIdCacheSize(200000)
>>>>>>     .nodeId2NodeCacheSize(750000)
>>>>>>     .nodeMissCacheSize(1000)
>>>>>>     .nodeTableBaseName("nodes")
>>>>>>     .primaryIndexTriples("SPO")
>>>>>>     .tripleIndexes(tindexes)
>>>>>>     .primaryIndexQuads("GSPO")
>>>>>>     .quadIndexes(qindexes)
>>>>>>     .prefixTableBaseName("prefixes")
>>>>>>     .primaryIndexPrefix("GPU")
>>>>>>     .prefixIndexes(pindexes)
>>>>>>     .build();
>>>>>> public static void main(String[] args) {
>>>>>> // TODO Auto-generated method stub
>>>>>> final String DATASET_DIR_NAME = "DyLDO100";
>>>>>>         Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME 
>>>>>> );
>>>>>>         Location location = Location.create(DATASET_DIR_NAME);
>>>>>>         StoreParams custom_params = 
>>>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc,  pDft);
>>>>>>        DatabaseConnection.connectCreate(location, custom_params);
>>>>>>        StoreParams params = StoreParams.getSmallStoreParams();
>>>>>>         System.out.println(params);
>>>>>> }
>>>>>> }
>>>>>> Please help.
>>>>>> Regards,
>>>>>> Samita Bai
>>>>>> ________________________________
>>>>>> P : Please consider the environment before printing this e-mail
>>>>>> ________________________________
>>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>>>>>> contain confidential and privileged information. If you are not the 
>>>>>> intended recipient, please notify the sender immediately by return 
>>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or 
>>>>>> use of this information by a person other than the intended recipient is 
>>>>>> unauthorized and may be illegal.
>>>>>> ________________________________
>>>>
>>>>
>>>> P : Please consider the environment before printing this e-mail
>>>>
>>>> ________________________________
>>>>
>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>>>> contain confidential and privileged information. If you are not the 
>>>> intended recipient, please notify the sender immediately by return e-mail, 
>>>> delete this e-mail and destroy any copies. Any dissemination or use of 
>>>> this information by a person other than the intended recipient is 
>>>> unauthorized and may be illegal.
>>>>
>>>> ________________________________
>>>
>>
>> P : Please consider the environment before printing this e-mail
>>
>> ________________________________
>>
>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>> contain confidential and privileged information. If you are not the intended 
>> recipient, please notify the sender immediately by return e-mail, delete 
>> this e-mail and destroy any copies. Any dissemination or use of this 
>> information by a person other than the intended recipient is unauthorized 
>> and may be illegal.
>>
>> ________________________________
>

P : Please consider the environment before printing this e-mail

________________________________

CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
contain confidential and privileged information. If you are not the intended 
recipient, please notify the sender immediately by return e-mail, delete this 
e-mail and destroy any copies. Any dissemination or use of this information by 
a person other than the intended recipient is unauthorized and may be illegal.

________________________________

Re: TDB 2 Store Parameters

Reply via email to