Re: TDB 2 Store Parameters

ajs6f Mon, 16 Apr 2018 09:32:41 -0700

You should be able to check the validity of any of your files just by running 
them through Jena's `riot` command.


You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or 
`tdb2.tdbloader` commands.

ajs6f

> On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus 
> <[email protected]> wrote:
> 
> OK Andy I got your point. Can you please share the code that you used to read 
> the Dynamic Linked Data Observatory dataset?
> 
> 
> 
> Regards,
> 
> Samita Bai
> 
> ________________________________
> From: Andy Seaborne <[email protected]>
> Sent: 16 April 2018 15:34:07
> To: [email protected]
> Subject: Re: TDB 2 Store Parameters
> 
> If you wish to prcoess the data as it is parsed, then see StreamRDF and
> either
> 
> NxParser, which is not part of Jena, is not a validating parser.
> 
> If the data is not valid, then you will have problems at some point,
> either loading, querying or outputting later.
> 
> Adam has explained that TDB2 inxexes heavily so that querying is well
> severed.
> 
> We can't help with the parser errors without knowing what they are.
> 
> Which files from Dynamic Linked Data Observatory are you processing?
> Don't the later ones replace the earlier ones?
> 
> I found that the last n-quads file was 42 million triples and all valid.
> 
>     Andy
> 
> On 16/04/18 11:05, ajs6f wrote:
>> Is there are syntax errors in your RDF (and it sounds like that is why Jena 
>> will not read it directly) you are doing yourself no service by taking 
>> unusual pains to force TDB to ingest your data.
>> 
>> Please show us the errors that Jena is throwing trying to read your data and 
>> an appropriate sample of the data in question.
>> 
>> 
>> ajs6f
>> 
>>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus 
>>> <[email protected]> wrote:
>>> 
>>> In addition to previous query. It is taking a lot of time to first parse 
>>> the dataset using NXParser then checking for object, and creating quad 
>>> again and storing in TDB. It could be very simple if we can take the quad 
>>> check its object and insert it in TDB.
>>> 
>>> 
>>> But Jena is not helping me with this 😞
>>> 
>>> 
>>> So I have to create quads again and store it in TDB.
>>> 
>>> 
>>> Any help is surely appreciated.
>>> 
>>> 
>>> Regards,
>>> 
>>> Samita Bai
>>> 
>>> ________________________________
>>> From: Samita Bai / PhD CS Scholar @ City Campus
>>> Sent: 16 April 2018 13:33:51
>>> To: [email protected]
>>> Subject: Re: TDB 2 Store Parameters
>>> 
>>> 
>>> Thank you Andy and Adam for the help. Actually, I am just indexing the 
>>> quads where object is either literal or foreign URI (i.e. Object belonging 
>>> to different dataset than subject), I am using NXParser (as Jena is giving 
>>> various parsing errors) to parse the dataset and then I am storing in TDB2 
>>> in the following manner.
>>> 
>>> 
>>> 
>>> public  void SetQuadsList(String sub, String pred, String obj, String 
>>> context) {
>>> Node subjects = NodeFactory.createURI(sub);
>>> Node objects = NodeFactory.createURI(obj);
>>> Node contexts =NodeFactory.createURI(context);
>>> //Node rdfSeeAlso = RDFS.seeAlso.asNode();
>>> 
>>> Node predicates =NodeFactory.createURI(pred);
>>> 
>>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects);
>>> 
>>> Quad quads = Quad.create(contexts, subjects, predicates, objects);
>>> 
>>> QuadList.add(quads);
>>> 
>>> //System.out.println("Number of backlinks:" + QuadList.size());
>>> 
>>> //System.out.println("quad written");
>>> 
>>> //System.out.println("Quad"+quads.toString());
>>> 
>>> }
>>> public List<Quad> GetQuadsList(){
>>> return QuadList;
>>> }
>>> public void QuadsToTDB(List<Quad> quadList) {
>>> final String DATASET_DIR_NAME = "DyLDO1000K_Index";
>>>        Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME );
>>> 
>>> 
>>>        dataset.begin ( ReadWrite.WRITE );
>>>        try {
>>>        DatasetGraph dsg = dataset.asDatasetGraph();
>>>            Iterator<Quad> quads = quadList.iterator();
>>>            System.out.println("Size of Quad List: "+quadList.size());
>>>            while ( quads.hasNext() ) {
>>>            //System.out.println("here");
>>>                Quad quad = quads.next();
>>>                dsg.add(quad);
>>>                //System.out.println(quad.toString()+ "added");
>>>                //RDFDataMgr.writeQuads(System.out, quads);
>>>              //  RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>> 
>>>            }
>>>            System.out.println("dsg created of size "+dsg.size());
>>>            //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>            System.out.println("written dsg using datamgr.");
>>> 
>>> 
>>>            //System.out.println(dataset.isEmpty());
>>>            //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>            dataset.commit();
>>> 
>>>            System.out.println("committed dataset.");
>>> 
>>> 
>>>        } catch ( Exception e ) {
>>>            e.printStackTrace(System.err);
>>>            //dataset.abort();
>>>        } finally {
>>>        //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>            dataset.end();
>>> 
>>>        }
>>>        System.out.println("end method.");
>>> }}
>>> 
>>> 
>>> I have indexed 40,000 files (as I have spilited the dataset into files 
>>> according to context) and the index size has become 120 GB. I have a total 
>>> of 1,35,600 files whose total size is 19.8 GB only.
>>> 
>>> 
>>> Why the TDB is making such BIG index size. I am confused :( is there any 
>>> problem in my code.
>>> 
>>> 
>>> Please suggest me if there can be some improvements.
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Samita Bai
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: ajs6f <[email protected]>
>>> Sent: 15 April 2018 03:07:59
>>> To: [email protected]
>>> Subject: Re: TDB 2 Store Parameters
>>> 
>>> 42 million quads is nothing like so many that either TDB version should 
>>> have any problem doing normal indexing (assuming very little in the way of 
>>> hardware-- I ingest datasets like that on my laptop all the time).
>>> 
>>> Do you have some extraordinary hardware limitations?
>>> 
>>> Adam
>>> 
>>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <[email protected]> wrote:
>>>> 
>>>> Hi Samita,
>>>> 
>>>> Firstly - as Adam points out - if theer are no indexes then access to the 
>>>> data will be very slow.  For a GSPO index,  that means squeries must be 
>>>> "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }".
>>>> 
>>>> GSPO means lookup by G then S within those G and the same for P then O.
>>>> 
>>>> I looked at the data and it seems to be able 42 million quads.
>>>> 
>>>> Using TDB1 (the loader is faster at this scale currently) is likely to be 
>>>> a better choice.
>>>> 
>>>> Looking at StoreParams in TDB2:
>>>> 
>>>> The code below creates the database at TDB2Factory.connectDataset so any 
>>>> StoreParams after that do not affect indexing.
>>>> 
>>>> I tried to make it work in the release but the code ignores provided 
>>>> StoreParams - sorry.  Even if it did work, it hits a test to make sure 
>>>> there are basic indexing (Adam's point).
>>>> 
>>>>   Andy
>>>> 
>>>> 
>>>> On 13/04/18 13:42, Samita Bai  / PhD CS Scholar @ City Campus wrote:
>>>>> I wrote the following code to build only one type of triple and quad 
>>>>> index but it is still creating all indexes 😞
>>>>> package ldbqPack;
>>>>> import org.apache.jena.query.Dataset;
>>>>> import org.apache.jena.tdb2.TDB2Factory;
>>>>> import org.apache.jena.tdb2.setup.StoreParams;
>>>>> import org.apache.jena.tdb2.sys.DatabaseConnection;
>>>>> import org.apache.jena.dboe.base.block.FileMode;
>>>>> import org.apache.jena.dboe.base.file.Location;
>>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory;
>>>>> public class StrPrms {
>>>>> static String[] tindexes= {"SPO"};
>>>>> static String[] qindexes= {"GSPO"};
>>>>> static String[] pindexes= {"GPU"};
>>>>> static final StoreParams pApp = StoreParams.builder()
>>>>>       .blockSize(12)              // Not dynamic
>>>>>       .nodeMissCacheSize(12)      // Dynamic
>>>>>       .build();
>>>>>   static final StoreParams pLoc = StoreParams.builder()
>>>>>       .blockSize(0)
>>>>>       .nodeMissCacheSize(0).build();
>>>>>   static final StoreParams pDft = StoreParams.builder()
>>>>>    .fileMode(FileMode.mapped)
>>>>>    .blockSize(8192)
>>>>>    .blockReadCacheSize(5000)
>>>>>    .blockWriteCacheSize(1000)
>>>>>    .node2NodeIdCacheSize(200000)
>>>>>    .nodeId2NodeCacheSize(750000)
>>>>>    .nodeMissCacheSize(1000)
>>>>>    .nodeTableBaseName("nodes")
>>>>>    .primaryIndexTriples("SPO")
>>>>>    .tripleIndexes(tindexes)
>>>>>    .primaryIndexQuads("GSPO")
>>>>>    .quadIndexes(qindexes)
>>>>>    .prefixTableBaseName("prefixes")
>>>>>    .primaryIndexPrefix("GPU")
>>>>>    .prefixIndexes(pindexes)
>>>>>    .build();
>>>>> public static void main(String[] args) {
>>>>> // TODO Auto-generated method stub
>>>>> final String DATASET_DIR_NAME = "DyLDO100";
>>>>>        Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME );
>>>>>        Location location = Location.create(DATASET_DIR_NAME);
>>>>>        StoreParams custom_params = 
>>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc,  pDft);
>>>>>       DatabaseConnection.connectCreate(location, custom_params);
>>>>>       StoreParams params = StoreParams.getSmallStoreParams();
>>>>>        System.out.println(params);
>>>>> }
>>>>> }
>>>>> Please help.
>>>>> Regards,
>>>>> Samita Bai
>>>>> ________________________________
>>>>> P : Please consider the environment before printing this e-mail
>>>>> ________________________________
>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>>>>> contain confidential and privileged information. If you are not the 
>>>>> intended recipient, please notify the sender immediately by return 
>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or 
>>>>> use of this information by a person other than the intended recipient is 
>>>>> unauthorized and may be illegal.
>>>>> ________________________________
>>> 
>>> 
>>> P : Please consider the environment before printing this e-mail
>>> 
>>> ________________________________
>>> 
>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
>>> contain confidential and privileged information. If you are not the 
>>> intended recipient, please notify the sender immediately by return e-mail, 
>>> delete this e-mail and destroy any copies. Any dissemination or use of this 
>>> information by a person other than the intended recipient is unauthorized 
>>> and may be illegal.
>>> 
>>> ________________________________
>> 
> 
> P : Please consider the environment before printing this e-mail
> 
> ________________________________
> 
> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may 
> contain confidential and privileged information. If you are not the intended 
> recipient, please notify the sender immediately by return e-mail, delete this 
> e-mail and destroy any copies. Any dissemination or use of this information 
> by a person other than the intended recipient is unauthorized and may be 
> illegal.
> 
> ________________________________

Re: TDB 2 Store Parameters

Reply via email to