Re: TDB 2 Store Parameters

Samita Bai / PhD CS Scholar @ City Campus Mon, 16 Apr 2018 09:28:48 -0700

OK Andy I got your point. Can you please share the code that you used to read 
the Dynamic Linked Data Observatory dataset?




Regards,

Samita Bai

________________________________
From: Andy Seaborne <a...@apache.org>
Sent: 16 April 2018 15:34:07
To: users@jena.apache.org
Subject: Re: TDB 2 Store Parameters

If you wish to prcoess the data as it is parsed, then see StreamRDF and
either

NxParser, which is not part of Jena, is not a validating parser.

If the data is not valid, then you will have problems at some point,
either loading, querying or outputting later.

Adam has explained that TDB2 inxexes heavily so that querying is well
severed.

We can't help with the parser errors without knowing what they are.

Which files from Dynamic Linked Data Observatory are you processing?
Don't the later ones replace the earlier ones?

I found that the last n-quads file was 42 million triples and all valid.

     Andy

On 16/04/18 11:05, ajs6f wrote:
> Is there are syntax errors in your RDF (and it sounds like that is why Jena 
> will not read it directly) you are doing yourself no service by taking 
> unusual pains to force TDB to ingest your data.
>
> Please show us the errors that Jena is throwing trying to read your data and 
> an appropriate sample of the data in question.
>
>
> ajs6f
>
>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus 
>> <s...@iba.edu.pk> wrote:
>>
>> In addition to previous query. It is taking a lot of time to first parse the 
>> dataset using NXParser then checking for object, and creating quad again and 
>> storing in TDB. It could be very simple if we can take the quad check its 
>> object and insert it in TDB.
>>
>>
>> But Jena is not helping me with this 😞
>>
>>
>> So I have to create quads again and store it in TDB.
>>
>>
>> Any help is surely appreciated.
>>
>>
>> Regards,
>>
>> Samita Bai
>>
>> ________________________________
>> From: Samita Bai / PhD CS Scholar @ City Campus
>> Sent: 16 April 2018 13:33:51
>> To: users@jena.apache.org
>> Subject: Re: TDB 2 Store Parameters
>>
>>
>> Thank you Andy and Adam for the help. Actually, I am just indexing the quads 
>> where object is either literal or foreign URI (i.e. Object belonging to 
>> different dataset than subject), I am using NXParser (as Jena is giving 
>> various parsing errors) to parse the dataset and then I am storing in TDB2 
>> in the following manner.
>>
>>
>>
>> public  void SetQuadsList(String sub, String pred, String obj, String 
>> context) {
>> Node subjects = NodeFactory.createURI(sub);
>> Node objects = NodeFactory.createURI(obj);
>> Node contexts =NodeFactory.createURI(context);
>> //Node rdfSeeAlso = RDFS.seeAlso.asNode();
>>
>> Node predicates =NodeFactory.createURI(pred);
>>
>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects);
>>
>> Quad quads = Quad.create(contexts, subjects, predicates, objects);
>>
>> QuadList.add(quads);
>>
>> //System.out.println("Number of backlinks:" + QuadList.size());
>>
>> //System.out.println("quad written");
>>
>> //System.out.println("Quad"+quads.toString());
>>
>> }
>> public List<Quad> GetQuadsList(){
>> return QuadList;
>> }
>> public void QuadsToTDB(List<Quad> quadList) {
>> final String DATASET_DIR_NAME = "DyLDO1000K_Index";
>>         Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME );
>>
>>
>>         dataset.begin ( ReadWrite.WRITE );
>>         try {
>>         DatasetGraph dsg = dataset.asDatasetGraph();
>>             Iterator<Quad> quads = quadList.iterator();
>>             System.out.println("Size of Quad List: "+quadList.size());
>>             while ( quads.hasNext() ) {
>>             //System.out.println("here");
>>                 Quad quad = quads.next();
>>                 dsg.add(quad);
>>                 //System.out.println(quad.toString()+ "added");
>>                 //RDFDataMgr.writeQuads(System.out, quads);
>>               //  RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>
>>             }
>>             System.out.println("dsg created of size "+dsg.size());
>>             //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>             System.out.println("written dsg using datamgr.");
>>
>>
>>             //System.out.println(dataset.isEmpty());
>>             //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>             dataset.commit();
>>
>>             System.out.println("committed dataset.");
>>
>>
>>         } catch ( Exception e ) {
>>             e.printStackTrace(System.err);
>>             //dataset.abort();
>>         } finally {
>>         //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>             dataset.end();
>>
>>         }
>>         System.out.println("end method.");
>> }}
>>
>>
>> I have indexed 40,000 files (as I have spilited the dataset into files 
>> according to context) and the index size has become 120 GB. I have a total 
>> of 1,35,600 files whose total size is 19.8 GB only.
>>
>>
>> Why the TDB is making such BIG index size. I am confused :( is there any 
>> problem in my code.
>>
>>
>> Please suggest me if there can be some improvements.
>>
>>
>>
>> Regards,
>>
>> Samita Bai
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: ajs6f <aj...@apache.org>
>> Sent: 15 April 2018 03:07:59
>> To: users@jena.apache.org
>> Subject: Re: TDB 2 Store Parameters
>>
>> 42 million quads is nothing like so many that either TDB version should have 
>> any problem doing normal indexing (assuming very little in the way of 
>> hardware-- I ingest datasets like that on my laptop all the time).
>>
>> Do you have some extraordinary hardware limitations?
>>
>> Adam
>>
>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <a...@apache.org> wrote:
>>>
>>> Hi Samita,
>>>
>>> Firstly - as Adam points out - if theer are no indexes then access to the 
>>> data will be very slow.  For a GSPO index,  that means squeries must be 
>>> "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }".
>>>
>>> GSPO means lookup by G then S within those G and the same for P then O.
>>>
>>> I looked at the data and it seems to be able 42 million quads.
>>>
>>> Using TDB1 (the loader is faster at this scale currently) is likely to be a 
>>> better choice.
>>>
>>> Looking at StoreParams in TDB2:
>>>
>>> The code below creates the database at TDB2Factory.connectDataset so any 
>>> StoreParams after that do not affect indexing.
>>>
>>> I tried to make it work in the release but the code ignores provided 
>>> StoreParams - sorry.  Even if it did work, it hits a test to make sure 
>>> there are basic indexing (Adam's point).
>>>
>>>    Andy
>>>
>>>
>>> On 13/04/18 13:42, Samita Bai  / PhD CS Scholar @ City Campus wrote:
>>>> I wrote the following code to build only one type of triple and quad index 
>>>> but it is still creating all indexes 😞
>>>> package ldbqPack;
>>>> import org.apache.jena.query.Dataset;
>>>> import org.apache.jena.tdb2.TDB2Factory;
>>>> import org.apache.jena.tdb2.setup.StoreParams;
>>>> import org.apache.jena.tdb2.sys.DatabaseConnection;
>>>> import org.apache.jena.dboe.base.block.FileMode;
>>>> import org.apache.jena.dboe.base.file.Location;
>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory;
>>>> public class StrPrms {
>>>> static String[] tindexes= {"SPO"};
>>>> static String[] qindexes= {"GSPO"};
>>>> static String[] pindexes= {"GPU"};
>>>> static final StoreParams pApp = StoreParams.builder()
>>>>        .blockSize(12)              // Not dynamic
>>>>        .nodeMissCacheSize(12)      // Dynamic
>>>>        .build();
>>>>    static final StoreParams pLoc = StoreParams.builder()
>>>>        .blockSize(0)
>>>>        .nodeMissCacheSize(0).build();
>>>>    static final StoreParams pDft = StoreParams.builder()
>>>>     .fileMode(FileMode.mapped)
>>>>     .blockSize(8192)
>>>>     .blockReadCacheSize(5000)
>>>>     .blockWriteCacheSize(1000)
>>>>     .node2NodeIdCacheSize(200000)
>>>>     .nodeId2NodeCacheSize(750000)
>>>>     .nodeMissCacheSize(1000)
>>>>     .nodeTableBaseName("nodes")
>>>>     .primaryIndexTriples("SPO")
>>>>     .tripleIndexes(tindexes)
>>>>     .primaryIndexQuads("GSPO")
>>>>     .quadIndexes(qindexes)
>>>>     .prefixTableBaseName("prefixes")
>>>>     .primaryIndexPrefix("GPU")
>>>>     .prefixIndexes(pindexes)
>>>>     .build();
>>>> public static void main(String[] args) {
>>>> // TODO Auto-generated method stub
>>>> final String DATASET_DIR_NAME = "DyLDO100";
>>>>         Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME );
>>>>         Location location = Location.create(DATASET_DIR_NAME);
>>>>         StoreParams custom_params = 
>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc,  pDft);
>>>>        DatabaseConnection.connectCreate(location, custom_params);
>>>>        StoreParams params = StoreParams.getSmallStoreParams();
>>>>         System.out.println(params);
>>>> }
>>>> }
>>>> Please help.
>>>> Regards,
>>>> Samita Bai
>>>> ________________________________
>>>> P : Please consider the environment before printing this e-mail
>>>> ________________________________
>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>>>> contain confidential and privileged information. If you are not the 
>>>> intended recipient, please notify the sender immediately by return e-mail, 
>>>> delete this e-mail and destroy any copies. Any dissemination or use of 
>>>> this information by a person other than the intended recipient is 
>>>> unauthorized and may be illegal.
>>>> ________________________________
>>
>>
>> P : Please consider the environment before printing this e-mail
>>
>> ________________________________
>>
>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>> contain confidential and privileged information. If you are not the intended 
>> recipient, please notify the sender immediately by return e-mail, delete 
>> this e-mail and destroy any copies. Any dissemination or use of this 
>> information by a person other than the intended recipient is unauthorized 
>> and may be illegal.
>>
>> ________________________________
>

P : Please consider the environment before printing this e-mail

________________________________

CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
contain confidential and privileged information. If you are not the intended 
recipient, please notify the sender immediately by return e-mail, delete this 
e-mail and destroy any copies. Any dissemination or use of this information by 
a person other than the intended recipient is unauthorized and may be illegal.

________________________________

Re: TDB 2 Store Parameters

Reply via email to