You should be able to check the validity of any of your files just by running them through Jena's `riot` command.
You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or `tdb2.tdbloader` commands. ajs6f > On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus > <[email protected]> wrote: > > OK Andy I got your point. Can you please share the code that you used to read > the Dynamic Linked Data Observatory dataset? > > > > Regards, > > Samita Bai > > ________________________________ > From: Andy Seaborne <[email protected]> > Sent: 16 April 2018 15:34:07 > To: [email protected] > Subject: Re: TDB 2 Store Parameters > > If you wish to prcoess the data as it is parsed, then see StreamRDF and > either > > NxParser, which is not part of Jena, is not a validating parser. > > If the data is not valid, then you will have problems at some point, > either loading, querying or outputting later. > > Adam has explained that TDB2 inxexes heavily so that querying is well > severed. > > We can't help with the parser errors without knowing what they are. > > Which files from Dynamic Linked Data Observatory are you processing? > Don't the later ones replace the earlier ones? > > I found that the last n-quads file was 42 million triples and all valid. > > Andy > > On 16/04/18 11:05, ajs6f wrote: >> Is there are syntax errors in your RDF (and it sounds like that is why Jena >> will not read it directly) you are doing yourself no service by taking >> unusual pains to force TDB to ingest your data. >> >> Please show us the errors that Jena is throwing trying to read your data and >> an appropriate sample of the data in question. >> >> >> ajs6f >> >>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus >>> <[email protected]> wrote: >>> >>> In addition to previous query. It is taking a lot of time to first parse >>> the dataset using NXParser then checking for object, and creating quad >>> again and storing in TDB. It could be very simple if we can take the quad >>> check its object and insert it in TDB. >>> >>> >>> But Jena is not helping me with this 😞 >>> >>> >>> So I have to create quads again and store it in TDB. >>> >>> >>> Any help is surely appreciated. >>> >>> >>> Regards, >>> >>> Samita Bai >>> >>> ________________________________ >>> From: Samita Bai / PhD CS Scholar @ City Campus >>> Sent: 16 April 2018 13:33:51 >>> To: [email protected] >>> Subject: Re: TDB 2 Store Parameters >>> >>> >>> Thank you Andy and Adam for the help. Actually, I am just indexing the >>> quads where object is either literal or foreign URI (i.e. Object belonging >>> to different dataset than subject), I am using NXParser (as Jena is giving >>> various parsing errors) to parse the dataset and then I am storing in TDB2 >>> in the following manner. >>> >>> >>> >>> public void SetQuadsList(String sub, String pred, String obj, String >>> context) { >>> Node subjects = NodeFactory.createURI(sub); >>> Node objects = NodeFactory.createURI(obj); >>> Node contexts =NodeFactory.createURI(context); >>> //Node rdfSeeAlso = RDFS.seeAlso.asNode(); >>> >>> Node predicates =NodeFactory.createURI(pred); >>> >>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects); >>> >>> Quad quads = Quad.create(contexts, subjects, predicates, objects); >>> >>> QuadList.add(quads); >>> >>> //System.out.println("Number of backlinks:" + QuadList.size()); >>> >>> //System.out.println("quad written"); >>> >>> //System.out.println("Quad"+quads.toString()); >>> >>> } >>> public List<Quad> GetQuadsList(){ >>> return QuadList; >>> } >>> public void QuadsToTDB(List<Quad> quadList) { >>> final String DATASET_DIR_NAME = "DyLDO1000K_Index"; >>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME ); >>> >>> >>> dataset.begin ( ReadWrite.WRITE ); >>> try { >>> DatasetGraph dsg = dataset.asDatasetGraph(); >>> Iterator<Quad> quads = quadList.iterator(); >>> System.out.println("Size of Quad List: "+quadList.size()); >>> while ( quads.hasNext() ) { >>> //System.out.println("here"); >>> Quad quad = quads.next(); >>> dsg.add(quad); >>> //System.out.println(quad.toString()+ "added"); >>> //RDFDataMgr.writeQuads(System.out, quads); >>> // RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>> >>> } >>> System.out.println("dsg created of size "+dsg.size()); >>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>> System.out.println("written dsg using datamgr."); >>> >>> >>> //System.out.println(dataset.isEmpty()); >>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>> dataset.commit(); >>> >>> System.out.println("committed dataset."); >>> >>> >>> } catch ( Exception e ) { >>> e.printStackTrace(System.err); >>> //dataset.abort(); >>> } finally { >>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>> dataset.end(); >>> >>> } >>> System.out.println("end method."); >>> }} >>> >>> >>> I have indexed 40,000 files (as I have spilited the dataset into files >>> according to context) and the index size has become 120 GB. I have a total >>> of 1,35,600 files whose total size is 19.8 GB only. >>> >>> >>> Why the TDB is making such BIG index size. I am confused :( is there any >>> problem in my code. >>> >>> >>> Please suggest me if there can be some improvements. >>> >>> >>> >>> Regards, >>> >>> Samita Bai >>> >>> >>> >>> >>> >>> >>> ________________________________ >>> From: ajs6f <[email protected]> >>> Sent: 15 April 2018 03:07:59 >>> To: [email protected] >>> Subject: Re: TDB 2 Store Parameters >>> >>> 42 million quads is nothing like so many that either TDB version should >>> have any problem doing normal indexing (assuming very little in the way of >>> hardware-- I ingest datasets like that on my laptop all the time). >>> >>> Do you have some extraordinary hardware limitations? >>> >>> Adam >>> >>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <[email protected]> wrote: >>>> >>>> Hi Samita, >>>> >>>> Firstly - as Adam points out - if theer are no indexes then access to the >>>> data will be very slow. For a GSPO index, that means squeries must be >>>> "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }". >>>> >>>> GSPO means lookup by G then S within those G and the same for P then O. >>>> >>>> I looked at the data and it seems to be able 42 million quads. >>>> >>>> Using TDB1 (the loader is faster at this scale currently) is likely to be >>>> a better choice. >>>> >>>> Looking at StoreParams in TDB2: >>>> >>>> The code below creates the database at TDB2Factory.connectDataset so any >>>> StoreParams after that do not affect indexing. >>>> >>>> I tried to make it work in the release but the code ignores provided >>>> StoreParams - sorry. Even if it did work, it hits a test to make sure >>>> there are basic indexing (Adam's point). >>>> >>>> Andy >>>> >>>> >>>> On 13/04/18 13:42, Samita Bai / PhD CS Scholar @ City Campus wrote: >>>>> I wrote the following code to build only one type of triple and quad >>>>> index but it is still creating all indexes 😞 >>>>> package ldbqPack; >>>>> import org.apache.jena.query.Dataset; >>>>> import org.apache.jena.tdb2.TDB2Factory; >>>>> import org.apache.jena.tdb2.setup.StoreParams; >>>>> import org.apache.jena.tdb2.sys.DatabaseConnection; >>>>> import org.apache.jena.dboe.base.block.FileMode; >>>>> import org.apache.jena.dboe.base.file.Location; >>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory; >>>>> public class StrPrms { >>>>> static String[] tindexes= {"SPO"}; >>>>> static String[] qindexes= {"GSPO"}; >>>>> static String[] pindexes= {"GPU"}; >>>>> static final StoreParams pApp = StoreParams.builder() >>>>> .blockSize(12) // Not dynamic >>>>> .nodeMissCacheSize(12) // Dynamic >>>>> .build(); >>>>> static final StoreParams pLoc = StoreParams.builder() >>>>> .blockSize(0) >>>>> .nodeMissCacheSize(0).build(); >>>>> static final StoreParams pDft = StoreParams.builder() >>>>> .fileMode(FileMode.mapped) >>>>> .blockSize(8192) >>>>> .blockReadCacheSize(5000) >>>>> .blockWriteCacheSize(1000) >>>>> .node2NodeIdCacheSize(200000) >>>>> .nodeId2NodeCacheSize(750000) >>>>> .nodeMissCacheSize(1000) >>>>> .nodeTableBaseName("nodes") >>>>> .primaryIndexTriples("SPO") >>>>> .tripleIndexes(tindexes) >>>>> .primaryIndexQuads("GSPO") >>>>> .quadIndexes(qindexes) >>>>> .prefixTableBaseName("prefixes") >>>>> .primaryIndexPrefix("GPU") >>>>> .prefixIndexes(pindexes) >>>>> .build(); >>>>> public static void main(String[] args) { >>>>> // TODO Auto-generated method stub >>>>> final String DATASET_DIR_NAME = "DyLDO100"; >>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME ); >>>>> Location location = Location.create(DATASET_DIR_NAME); >>>>> StoreParams custom_params = >>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc, pDft); >>>>> DatabaseConnection.connectCreate(location, custom_params); >>>>> StoreParams params = StoreParams.getSmallStoreParams(); >>>>> System.out.println(params); >>>>> } >>>>> } >>>>> Please help. >>>>> Regards, >>>>> Samita Bai >>>>> ________________________________ >>>>> P : Please consider the environment before printing this e-mail >>>>> ________________________________ >>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >>>>> contain confidential and privileged information. If you are not the >>>>> intended recipient, please notify the sender immediately by return >>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or >>>>> use of this information by a person other than the intended recipient is >>>>> unauthorized and may be illegal. >>>>> ________________________________ >>> >>> >>> P : Please consider the environment before printing this e-mail >>> >>> ________________________________ >>> >>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >>> contain confidential and privileged information. If you are not the >>> intended recipient, please notify the sender immediately by return e-mail, >>> delete this e-mail and destroy any copies. Any dissemination or use of this >>> information by a person other than the intended recipient is unauthorized >>> and may be illegal. >>> >>> ________________________________ >> > > P : Please consider the environment before printing this e-mail > > ________________________________ > > CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may > contain confidential and privileged information. If you are not the intended > recipient, please notify the sender immediately by return e-mail, delete this > e-mail and destroy any copies. Any dissemination or use of this information > by a person other than the intended recipient is unauthorized and may be > illegal. > > ________________________________
