Thank you Andy and Adam for the help. Actually, I am just indexing the quads where object is either literal or foreign URI (i.e. Object belonging to different dataset than subject), I am using NXParser (as Jena is giving various parsing errors) to parse the dataset and then I am storing in TDB2 in the following manner.
public void SetQuadsList(String sub, String pred, String obj, String context) { Node subjects = NodeFactory.createURI(sub); Node objects = NodeFactory.createURI(obj); Node contexts =NodeFactory.createURI(context); //Node rdfSeeAlso = RDFS.seeAlso.asNode(); Node predicates =NodeFactory.createURI(pred); //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects); Quad quads = Quad.create(contexts, subjects, predicates, objects); QuadList.add(quads); //System.out.println("Number of backlinks:" + QuadList.size()); //System.out.println("quad written"); //System.out.println("Quad"+quads.toString()); } public List<Quad> GetQuadsList(){ return QuadList; } public void QuadsToTDB(List<Quad> quadList) { final String DATASET_DIR_NAME = "DyLDO1000K_Index"; Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME ); dataset.begin ( ReadWrite.WRITE ); try { DatasetGraph dsg = dataset.asDatasetGraph(); Iterator<Quad> quads = quadList.iterator(); System.out.println("Size of Quad List: "+quadList.size()); while ( quads.hasNext() ) { //System.out.println("here"); Quad quad = quads.next(); dsg.add(quad); //System.out.println(quad.toString()+ "added"); //RDFDataMgr.writeQuads(System.out, quads); // RDFDataMgr.write(System.out, dsg, Lang.NQUADS); } System.out.println("dsg created of size "+dsg.size()); //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); System.out.println("written dsg using datamgr."); //System.out.println(dataset.isEmpty()); //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); dataset.commit(); System.out.println("committed dataset."); } catch ( Exception e ) { e.printStackTrace(System.err); //dataset.abort(); } finally { //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); dataset.end(); } System.out.println("end method."); }} I have indexed 40,000 files (as I have spilited the dataset into files according to context) and the index size has become 120 GB. I have a total of 1,35,600 files whose total size is 19.8 GB only. Why the TDB is making such BIG index size. I am confused :( is there any problem in my code. Please suggest me if there can be some improvements. Regards, Samita Bai ________________________________ From: ajs6f <aj...@apache.org> Sent: 15 April 2018 03:07:59 To: users@jena.apache.org Subject: Re: TDB 2 Store Parameters 42 million quads is nothing like so many that either TDB version should have any problem doing normal indexing (assuming very little in the way of hardware-- I ingest datasets like that on my laptop all the time). Do you have some extraordinary hardware limitations? Adam > On Apr 14, 2018, at 11:42 AM, Andy Seaborne <a...@apache.org> wrote: > > Hi Samita, > > Firstly - as Adam points out - if theer are no indexes then access to the > data will be very slow. For a GSPO index, that means squeries must be > "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }". > > GSPO means lookup by G then S within those G and the same for P then O. > > I looked at the data and it seems to be able 42 million quads. > > Using TDB1 (the loader is faster at this scale currently) is likely to be a > better choice. > > Looking at StoreParams in TDB2: > > The code below creates the database at TDB2Factory.connectDataset so any > StoreParams after that do not affect indexing. > > I tried to make it work in the release but the code ignores provided > StoreParams - sorry. Even if it did work, it hits a test to make sure there > are basic indexing (Adam's point). > > Andy > > > On 13/04/18 13:42, Samita Bai / PhD CS Scholar @ City Campus wrote: >> I wrote the following code to build only one type of triple and quad index >> but it is still creating all indexes 😞 >> package ldbqPack; >> import org.apache.jena.query.Dataset; >> import org.apache.jena.tdb2.TDB2Factory; >> import org.apache.jena.tdb2.setup.StoreParams; >> import org.apache.jena.tdb2.sys.DatabaseConnection; >> import org.apache.jena.dboe.base.block.FileMode; >> import org.apache.jena.dboe.base.file.Location; >> import org.apache.jena.tdb2.setup.StoreParamsFactory; >> public class StrPrms { >> static String[] tindexes= {"SPO"}; >> static String[] qindexes= {"GSPO"}; >> static String[] pindexes= {"GPU"}; >> static final StoreParams pApp = StoreParams.builder() >> .blockSize(12) // Not dynamic >> .nodeMissCacheSize(12) // Dynamic >> .build(); >> static final StoreParams pLoc = StoreParams.builder() >> .blockSize(0) >> .nodeMissCacheSize(0).build(); >> static final StoreParams pDft = StoreParams.builder() >> .fileMode(FileMode.mapped) >> .blockSize(8192) >> .blockReadCacheSize(5000) >> .blockWriteCacheSize(1000) >> .node2NodeIdCacheSize(200000) >> .nodeId2NodeCacheSize(750000) >> .nodeMissCacheSize(1000) >> .nodeTableBaseName("nodes") >> .primaryIndexTriples("SPO") >> .tripleIndexes(tindexes) >> .primaryIndexQuads("GSPO") >> .quadIndexes(qindexes) >> .prefixTableBaseName("prefixes") >> .primaryIndexPrefix("GPU") >> .prefixIndexes(pindexes) >> .build(); >> public static void main(String[] args) { >> // TODO Auto-generated method stub >> final String DATASET_DIR_NAME = "DyLDO100"; >> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME ); >> Location location = Location.create(DATASET_DIR_NAME); >> StoreParams custom_params = >> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc, pDft); >> DatabaseConnection.connectCreate(location, custom_params); >> StoreParams params = StoreParams.getSmallStoreParams(); >> System.out.println(params); >> } >> } >> Please help. >> Regards, >> Samita Bai >> ________________________________ >> P : Please consider the environment before printing this e-mail >> ________________________________ >> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >> contain confidential and privileged information. If you are not the intended >> recipient, please notify the sender immediately by return e-mail, delete >> this e-mail and destroy any copies. Any dissemination or use of this >> information by a person other than the intended recipient is unauthorized >> and may be illegal. >> ________________________________ P : Please consider the environment before printing this e-mail ________________________________ CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may contain confidential and privileged information. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this e-mail and destroy any copies. Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal. ________________________________