Dear Andy,
I downloaded the same dataset from the link as you told i.e. http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz Then I extracted and ran the following code public class ReadQuadInJena { public static void main(String[] args) { // TODO Auto-generated method stub TDBLoader tlobj= new TDBLoader(); String Ds ="/home/samita/data.nq"; Location location = Location.create("/home/samita/Load_TDB"); DatasetGraphTDB dgtdb = DatasetBuilderStd.create(location); try { InputStream is = new FileInputStream(AndyDs); tlobj.loadDataset(dgtdb, is); }catch(FileNotFoundException e) {} } It ended up with this error. Exception in thread "main" org.apache.jena.riot.RiotException: [line: 30506, col: 232] Illegal character in IRI (codepoint 0x7C, '|'): <http://fonts.googleapis.com/css?family=Nunito[|]...> at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147) at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148) at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105) at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:67) at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:54) at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41) at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:195) at org.apache.jena.riot.RDFParser.read(RDFParser.java:334) at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:324) at org.apache.jena.riot.RDFParser.parse(RDFParser.java:273) at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498) at org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:870) at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:693) at org.apache.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:152) at org.apache.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:115) at org.apache.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:256) at org.apache.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:191) at ldbqPack.ReadQuadInJena.main(ReadQuadInJena.java:47) If it was running fine at your end what's wrong with my code. Please help me. ________________________________ From: Andy Seaborne <a...@apache.org> Sent: 16 April 2018 22:13:36 To: users@jena.apache.org Subject: Re: TDB 2 Store Parameters I downlaoded http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz (the latest I could find) and used tdblaoder. Is that the data you are using? Andy On 16/04/18 17:32, ajs6f wrote: > You should be able to check the validity of any of your files just by running > them through Jena's `riot` command. > > You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or > `tdb2.tdbloader` commands. > > ajs6f > >> On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus >> <s...@iba.edu.pk> wrote: >> >> OK Andy I got your point. Can you please share the code that you used to >> read the Dynamic Linked Data Observatory dataset? >> >> >> >> Regards, >> >> Samita Bai >> >> ________________________________ >> From: Andy Seaborne <a...@apache.org> >> Sent: 16 April 2018 15:34:07 >> To: users@jena.apache.org >> Subject: Re: TDB 2 Store Parameters >> >> If you wish to prcoess the data as it is parsed, then see StreamRDF and >> either >> >> NxParser, which is not part of Jena, is not a validating parser. >> >> If the data is not valid, then you will have problems at some point, >> either loading, querying or outputting later. >> >> Adam has explained that TDB2 inxexes heavily so that querying is well >> severed. >> >> We can't help with the parser errors without knowing what they are. >> >> Which files from Dynamic Linked Data Observatory are you processing? >> Don't the later ones replace the earlier ones? >> >> I found that the last n-quads file was 42 million triples and all valid. >> >> Andy >> >> On 16/04/18 11:05, ajs6f wrote: >>> Is there are syntax errors in your RDF (and it sounds like that is why Jena >>> will not read it directly) you are doing yourself no service by taking >>> unusual pains to force TDB to ingest your data. >>> >>> Please show us the errors that Jena is throwing trying to read your data >>> and an appropriate sample of the data in question. >>> >>> >>> ajs6f >>> >>>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus >>>> <s...@iba.edu.pk> wrote: >>>> >>>> In addition to previous query. It is taking a lot of time to first parse >>>> the dataset using NXParser then checking for object, and creating quad >>>> again and storing in TDB. It could be very simple if we can take the quad >>>> check its object and insert it in TDB. >>>> >>>> >>>> But Jena is not helping me with this 😞 >>>> >>>> >>>> So I have to create quads again and store it in TDB. >>>> >>>> >>>> Any help is surely appreciated. >>>> >>>> >>>> Regards, >>>> >>>> Samita Bai >>>> >>>> ________________________________ >>>> From: Samita Bai / PhD CS Scholar @ City Campus >>>> Sent: 16 April 2018 13:33:51 >>>> To: users@jena.apache.org >>>> Subject: Re: TDB 2 Store Parameters >>>> >>>> >>>> Thank you Andy and Adam for the help. Actually, I am just indexing the >>>> quads where object is either literal or foreign URI (i.e. Object belonging >>>> to different dataset than subject), I am using NXParser (as Jena is giving >>>> various parsing errors) to parse the dataset and then I am storing in TDB2 >>>> in the following manner. >>>> >>>> >>>> >>>> public void SetQuadsList(String sub, String pred, String obj, String >>>> context) { >>>> Node subjects = NodeFactory.createURI(sub); >>>> Node objects = NodeFactory.createURI(obj); >>>> Node contexts =NodeFactory.createURI(context); >>>> //Node rdfSeeAlso = RDFS.seeAlso.asNode(); >>>> >>>> Node predicates =NodeFactory.createURI(pred); >>>> >>>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects); >>>> >>>> Quad quads = Quad.create(contexts, subjects, predicates, objects); >>>> >>>> QuadList.add(quads); >>>> >>>> //System.out.println("Number of backlinks:" + QuadList.size()); >>>> >>>> //System.out.println("quad written"); >>>> >>>> //System.out.println("Quad"+quads.toString()); >>>> >>>> } >>>> public List<Quad> GetQuadsList(){ >>>> return QuadList; >>>> } >>>> public void QuadsToTDB(List<Quad> quadList) { >>>> final String DATASET_DIR_NAME = "DyLDO1000K_Index"; >>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME ); >>>> >>>> >>>> dataset.begin ( ReadWrite.WRITE ); >>>> try { >>>> DatasetGraph dsg = dataset.asDatasetGraph(); >>>> Iterator<Quad> quads = quadList.iterator(); >>>> System.out.println("Size of Quad List: "+quadList.size()); >>>> while ( quads.hasNext() ) { >>>> //System.out.println("here"); >>>> Quad quad = quads.next(); >>>> dsg.add(quad); >>>> //System.out.println(quad.toString()+ "added"); >>>> //RDFDataMgr.writeQuads(System.out, quads); >>>> // RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>> >>>> } >>>> System.out.println("dsg created of size "+dsg.size()); >>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>> System.out.println("written dsg using datamgr."); >>>> >>>> >>>> //System.out.println(dataset.isEmpty()); >>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>> dataset.commit(); >>>> >>>> System.out.println("committed dataset."); >>>> >>>> >>>> } catch ( Exception e ) { >>>> e.printStackTrace(System.err); >>>> //dataset.abort(); >>>> } finally { >>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>> dataset.end(); >>>> >>>> } >>>> System.out.println("end method."); >>>> }} >>>> >>>> >>>> I have indexed 40,000 files (as I have spilited the dataset into files >>>> according to context) and the index size has become 120 GB. I have a total >>>> of 1,35,600 files whose total size is 19.8 GB only. >>>> >>>> >>>> Why the TDB is making such BIG index size. I am confused :( is there any >>>> problem in my code. >>>> >>>> >>>> Please suggest me if there can be some improvements. >>>> >>>> >>>> >>>> Regards, >>>> >>>> Samita Bai >>>> >>>> >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> From: ajs6f <aj...@apache.org> >>>> Sent: 15 April 2018 03:07:59 >>>> To: users@jena.apache.org >>>> Subject: Re: TDB 2 Store Parameters >>>> >>>> 42 million quads is nothing like so many that either TDB version should >>>> have any problem doing normal indexing (assuming very little in the way of >>>> hardware-- I ingest datasets like that on my laptop all the time). >>>> >>>> Do you have some extraordinary hardware limitations? >>>> >>>> Adam >>>> >>>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <a...@apache.org> wrote: >>>>> >>>>> Hi Samita, >>>>> >>>>> Firstly - as Adam points out - if theer are no indexes then access to the >>>>> data will be very slow. For a GSPO index, that means squeries must be >>>>> "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }". >>>>> >>>>> GSPO means lookup by G then S within those G and the same for P then O. >>>>> >>>>> I looked at the data and it seems to be able 42 million quads. >>>>> >>>>> Using TDB1 (the loader is faster at this scale currently) is likely to be >>>>> a better choice. >>>>> >>>>> Looking at StoreParams in TDB2: >>>>> >>>>> The code below creates the database at TDB2Factory.connectDataset so any >>>>> StoreParams after that do not affect indexing. >>>>> >>>>> I tried to make it work in the release but the code ignores provided >>>>> StoreParams - sorry. Even if it did work, it hits a test to make sure >>>>> there are basic indexing (Adam's point). >>>>> >>>>> Andy >>>>> >>>>> >>>>> On 13/04/18 13:42, Samita Bai / PhD CS Scholar @ City Campus wrote: >>>>>> I wrote the following code to build only one type of triple and quad >>>>>> index but it is still creating all indexes 😞 >>>>>> package ldbqPack; >>>>>> import org.apache.jena.query.Dataset; >>>>>> import org.apache.jena.tdb2.TDB2Factory; >>>>>> import org.apache.jena.tdb2.setup.StoreParams; >>>>>> import org.apache.jena.tdb2.sys.DatabaseConnection; >>>>>> import org.apache.jena.dboe.base.block.FileMode; >>>>>> import org.apache.jena.dboe.base.file.Location; >>>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory; >>>>>> public class StrPrms { >>>>>> static String[] tindexes= {"SPO"}; >>>>>> static String[] qindexes= {"GSPO"}; >>>>>> static String[] pindexes= {"GPU"}; >>>>>> static final StoreParams pApp = StoreParams.builder() >>>>>> .blockSize(12) // Not dynamic >>>>>> .nodeMissCacheSize(12) // Dynamic >>>>>> .build(); >>>>>> static final StoreParams pLoc = StoreParams.builder() >>>>>> .blockSize(0) >>>>>> .nodeMissCacheSize(0).build(); >>>>>> static final StoreParams pDft = StoreParams.builder() >>>>>> .fileMode(FileMode.mapped) >>>>>> .blockSize(8192) >>>>>> .blockReadCacheSize(5000) >>>>>> .blockWriteCacheSize(1000) >>>>>> .node2NodeIdCacheSize(200000) >>>>>> .nodeId2NodeCacheSize(750000) >>>>>> .nodeMissCacheSize(1000) >>>>>> .nodeTableBaseName("nodes") >>>>>> .primaryIndexTriples("SPO") >>>>>> .tripleIndexes(tindexes) >>>>>> .primaryIndexQuads("GSPO") >>>>>> .quadIndexes(qindexes) >>>>>> .prefixTableBaseName("prefixes") >>>>>> .primaryIndexPrefix("GPU") >>>>>> .prefixIndexes(pindexes) >>>>>> .build(); >>>>>> public static void main(String[] args) { >>>>>> // TODO Auto-generated method stub >>>>>> final String DATASET_DIR_NAME = "DyLDO100"; >>>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME >>>>>> ); >>>>>> Location location = Location.create(DATASET_DIR_NAME); >>>>>> StoreParams custom_params = >>>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc, pDft); >>>>>> DatabaseConnection.connectCreate(location, custom_params); >>>>>> StoreParams params = StoreParams.getSmallStoreParams(); >>>>>> System.out.println(params); >>>>>> } >>>>>> } >>>>>> Please help. >>>>>> Regards, >>>>>> Samita Bai >>>>>> ________________________________ >>>>>> P : Please consider the environment before printing this e-mail >>>>>> ________________________________ >>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >>>>>> contain confidential and privileged information. If you are not the >>>>>> intended recipient, please notify the sender immediately by return >>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or >>>>>> use of this information by a person other than the intended recipient is >>>>>> unauthorized and may be illegal. >>>>>> ________________________________ >>>> >>>> >>>> P : Please consider the environment before printing this e-mail >>>> >>>> ________________________________ >>>> >>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >>>> contain confidential and privileged information. If you are not the >>>> intended recipient, please notify the sender immediately by return e-mail, >>>> delete this e-mail and destroy any copies. Any dissemination or use of >>>> this information by a person other than the intended recipient is >>>> unauthorized and may be illegal. >>>> >>>> ________________________________ >>> >> >> P : Please consider the environment before printing this e-mail >> >> ________________________________ >> >> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >> contain confidential and privileged information. If you are not the intended >> recipient, please notify the sender immediately by return e-mail, delete >> this e-mail and destroy any copies. Any dissemination or use of this >> information by a person other than the intended recipient is unauthorized >> and may be illegal. >> >> ________________________________ > P : Please consider the environment before printing this e-mail ________________________________ CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may contain confidential and privileged information. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this e-mail and destroy any copies. Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal. ________________________________