I'm glad you got what you wanted, but you should also be aware that if you're just trying to load RDF into a TDB instance, there is no need at all to write Java code. The tdbloader and tdbloader2 CLI utilities work very very well for that.
ajs6f > On Apr 17, 2018, at 1:03 AM, Samita Bai / PhD CS Scholar @ City Campus > <[email protected]> wrote: > > Dear Andy & Adam, > > > Thanks a lot for the help, I got my code running finally. I just caught the > RiotException, that was all needed. Feeling so happy. > > > I really appreciate for your time and efforts :) > > > Best regards, > > Samita Bai > > ________________________________ > From: Samita Bai / PhD CS Scholar @ City Campus <[email protected]> > Sent: 17 April 2018 02:13:32 > To: [email protected] > Subject: Re: TDB 2 Store Parameters > > Dear Andy, > > > I downloaded the same dataset from the link as you told i.e. > > > http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz > > > Then I extracted and ran the following code > > > public class ReadQuadInJena { > > public static void main(String[] args) { > // TODO Auto-generated method stub > TDBLoader tlobj= new TDBLoader(); > String Ds ="/home/samita/data.nq"; > Location location = Location.create("/home/samita/Load_TDB"); > DatasetGraphTDB dgtdb = DatasetBuilderStd.create(location); > try { > InputStream is = new FileInputStream(AndyDs); > tlobj.loadDataset(dgtdb, is); > }catch(FileNotFoundException e) {} > } > > It ended up with this error. > > Exception in thread "main" org.apache.jena.riot.RiotException: [line: 30506, > col: 232] Illegal character in IRI (codepoint 0x7C, '|'): > <http://fonts.googleapis.com/css?family=Nunito[|]...> > at > org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147) > at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148) > at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105) > at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:67) > at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:54) > at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41) > at > org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:195) > at org.apache.jena.riot.RDFParser.read(RDFParser.java:334) > at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:324) > at org.apache.jena.riot.RDFParser.parse(RDFParser.java:273) > at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498) > at org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:870) > at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:693) > at > org.apache.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:152) > at > org.apache.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:115) > at org.apache.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:256) > at org.apache.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:191) > at ldbqPack.ReadQuadInJena.main(ReadQuadInJena.java:47) > > If it was running fine at your end what's wrong with my code. Please help me. > > > > > > ________________________________ > From: Andy Seaborne <[email protected]> > Sent: 16 April 2018 22:13:36 > To: [email protected] > Subject: Re: TDB 2 Store Parameters > > I downlaoded > > http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz > > (the latest I could find) > > and used tdblaoder. > > Is that the data you are using? > > Andy > > On 16/04/18 17:32, ajs6f wrote: >> You should be able to check the validity of any of your files just by >> running them through Jena's `riot` command. >> >> You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or >> `tdb2.tdbloader` commands. >> >> ajs6f >> >>> On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus >>> <[email protected]> wrote: >>> >>> OK Andy I got your point. Can you please share the code that you used to >>> read the Dynamic Linked Data Observatory dataset? >>> >>> >>> >>> Regards, >>> >>> Samita Bai >>> >>> ________________________________ >>> From: Andy Seaborne <[email protected]> >>> Sent: 16 April 2018 15:34:07 >>> To: [email protected] >>> Subject: Re: TDB 2 Store Parameters >>> >>> If you wish to prcoess the data as it is parsed, then see StreamRDF and >>> either >>> >>> NxParser, which is not part of Jena, is not a validating parser. >>> >>> If the data is not valid, then you will have problems at some point, >>> either loading, querying or outputting later. >>> >>> Adam has explained that TDB2 inxexes heavily so that querying is well >>> severed. >>> >>> We can't help with the parser errors without knowing what they are. >>> >>> Which files from Dynamic Linked Data Observatory are you processing? >>> Don't the later ones replace the earlier ones? >>> >>> I found that the last n-quads file was 42 million triples and all valid. >>> >>> Andy >>> >>> On 16/04/18 11:05, ajs6f wrote: >>>> Is there are syntax errors in your RDF (and it sounds like that is why >>>> Jena will not read it directly) you are doing yourself no service by >>>> taking unusual pains to force TDB to ingest your data. >>>> >>>> Please show us the errors that Jena is throwing trying to read your data >>>> and an appropriate sample of the data in question. >>>> >>>> >>>> ajs6f >>>> >>>>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus >>>>> <[email protected]> wrote: >>>>> >>>>> In addition to previous query. It is taking a lot of time to first parse >>>>> the dataset using NXParser then checking for object, and creating quad >>>>> again and storing in TDB. It could be very simple if we can take the quad >>>>> check its object and insert it in TDB. >>>>> >>>>> >>>>> But Jena is not helping me with this 😞 >>>>> >>>>> >>>>> So I have to create quads again and store it in TDB. >>>>> >>>>> >>>>> Any help is surely appreciated. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Samita Bai >>>>> >>>>> ________________________________ >>>>> From: Samita Bai / PhD CS Scholar @ City Campus >>>>> Sent: 16 April 2018 13:33:51 >>>>> To: [email protected] >>>>> Subject: Re: TDB 2 Store Parameters >>>>> >>>>> >>>>> Thank you Andy and Adam for the help. Actually, I am just indexing the >>>>> quads where object is either literal or foreign URI (i.e. Object >>>>> belonging to different dataset than subject), I am using NXParser (as >>>>> Jena is giving various parsing errors) to parse the dataset and then I am >>>>> storing in TDB2 in the following manner. >>>>> >>>>> >>>>> >>>>> public void SetQuadsList(String sub, String pred, String obj, String >>>>> context) { >>>>> Node subjects = NodeFactory.createURI(sub); >>>>> Node objects = NodeFactory.createURI(obj); >>>>> Node contexts =NodeFactory.createURI(context); >>>>> //Node rdfSeeAlso = RDFS.seeAlso.asNode(); >>>>> >>>>> Node predicates =NodeFactory.createURI(pred); >>>>> >>>>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects); >>>>> >>>>> Quad quads = Quad.create(contexts, subjects, predicates, objects); >>>>> >>>>> QuadList.add(quads); >>>>> >>>>> //System.out.println("Number of backlinks:" + QuadList.size()); >>>>> >>>>> //System.out.println("quad written"); >>>>> >>>>> //System.out.println("Quad"+quads.toString()); >>>>> >>>>> } >>>>> public List<Quad> GetQuadsList(){ >>>>> return QuadList; >>>>> } >>>>> public void QuadsToTDB(List<Quad> quadList) { >>>>> final String DATASET_DIR_NAME = "DyLDO1000K_Index"; >>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME ); >>>>> >>>>> >>>>> dataset.begin ( ReadWrite.WRITE ); >>>>> try { >>>>> DatasetGraph dsg = dataset.asDatasetGraph(); >>>>> Iterator<Quad> quads = quadList.iterator(); >>>>> System.out.println("Size of Quad List: "+quadList.size()); >>>>> while ( quads.hasNext() ) { >>>>> //System.out.println("here"); >>>>> Quad quad = quads.next(); >>>>> dsg.add(quad); >>>>> //System.out.println(quad.toString()+ "added"); >>>>> //RDFDataMgr.writeQuads(System.out, quads); >>>>> // RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>>> >>>>> } >>>>> System.out.println("dsg created of size "+dsg.size()); >>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>>> System.out.println("written dsg using datamgr."); >>>>> >>>>> >>>>> //System.out.println(dataset.isEmpty()); >>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>>> dataset.commit(); >>>>> >>>>> System.out.println("committed dataset."); >>>>> >>>>> >>>>> } catch ( Exception e ) { >>>>> e.printStackTrace(System.err); >>>>> //dataset.abort(); >>>>> } finally { >>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS); >>>>> dataset.end(); >>>>> >>>>> } >>>>> System.out.println("end method."); >>>>> }} >>>>> >>>>> >>>>> I have indexed 40,000 files (as I have spilited the dataset into files >>>>> according to context) and the index size has become 120 GB. I have a >>>>> total of 1,35,600 files whose total size is 19.8 GB only. >>>>> >>>>> >>>>> Why the TDB is making such BIG index size. I am confused :( is there any >>>>> problem in my code. >>>>> >>>>> >>>>> Please suggest me if there can be some improvements. >>>>> >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Samita Bai >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ________________________________ >>>>> From: ajs6f <[email protected]> >>>>> Sent: 15 April 2018 03:07:59 >>>>> To: [email protected] >>>>> Subject: Re: TDB 2 Store Parameters >>>>> >>>>> 42 million quads is nothing like so many that either TDB version should >>>>> have any problem doing normal indexing (assuming very little in the way >>>>> of hardware-- I ingest datasets like that on my laptop all the time). >>>>> >>>>> Do you have some extraordinary hardware limitations? >>>>> >>>>> Adam >>>>> >>>>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <[email protected]> wrote: >>>>>> >>>>>> Hi Samita, >>>>>> >>>>>> Firstly - as Adam points out - if theer are no indexes then access to >>>>>> the data will be very slow. For a GSPO index, that means squeries must >>>>>> be "GRAPH <uri> { ... }" and probably "GRAPH <uri> { <fixedSubject>.. }". >>>>>> >>>>>> GSPO means lookup by G then S within those G and the same for P then O. >>>>>> >>>>>> I looked at the data and it seems to be able 42 million quads. >>>>>> >>>>>> Using TDB1 (the loader is faster at this scale currently) is likely to >>>>>> be a better choice. >>>>>> >>>>>> Looking at StoreParams in TDB2: >>>>>> >>>>>> The code below creates the database at TDB2Factory.connectDataset so any >>>>>> StoreParams after that do not affect indexing. >>>>>> >>>>>> I tried to make it work in the release but the code ignores provided >>>>>> StoreParams - sorry. Even if it did work, it hits a test to make sure >>>>>> there are basic indexing (Adam's point). >>>>>> >>>>>> Andy >>>>>> >>>>>> >>>>>> On 13/04/18 13:42, Samita Bai / PhD CS Scholar @ City Campus wrote: >>>>>>> I wrote the following code to build only one type of triple and quad >>>>>>> index but it is still creating all indexes 😞 >>>>>>> package ldbqPack; >>>>>>> import org.apache.jena.query.Dataset; >>>>>>> import org.apache.jena.tdb2.TDB2Factory; >>>>>>> import org.apache.jena.tdb2.setup.StoreParams; >>>>>>> import org.apache.jena.tdb2.sys.DatabaseConnection; >>>>>>> import org.apache.jena.dboe.base.block.FileMode; >>>>>>> import org.apache.jena.dboe.base.file.Location; >>>>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory; >>>>>>> public class StrPrms { >>>>>>> static String[] tindexes= {"SPO"}; >>>>>>> static String[] qindexes= {"GSPO"}; >>>>>>> static String[] pindexes= {"GPU"}; >>>>>>> static final StoreParams pApp = StoreParams.builder() >>>>>>> .blockSize(12) // Not dynamic >>>>>>> .nodeMissCacheSize(12) // Dynamic >>>>>>> .build(); >>>>>>> static final StoreParams pLoc = StoreParams.builder() >>>>>>> .blockSize(0) >>>>>>> .nodeMissCacheSize(0).build(); >>>>>>> static final StoreParams pDft = StoreParams.builder() >>>>>>> .fileMode(FileMode.mapped) >>>>>>> .blockSize(8192) >>>>>>> .blockReadCacheSize(5000) >>>>>>> .blockWriteCacheSize(1000) >>>>>>> .node2NodeIdCacheSize(200000) >>>>>>> .nodeId2NodeCacheSize(750000) >>>>>>> .nodeMissCacheSize(1000) >>>>>>> .nodeTableBaseName("nodes") >>>>>>> .primaryIndexTriples("SPO") >>>>>>> .tripleIndexes(tindexes) >>>>>>> .primaryIndexQuads("GSPO") >>>>>>> .quadIndexes(qindexes) >>>>>>> .prefixTableBaseName("prefixes") >>>>>>> .primaryIndexPrefix("GPU") >>>>>>> .prefixIndexes(pindexes) >>>>>>> .build(); >>>>>>> public static void main(String[] args) { >>>>>>> // TODO Auto-generated method stub >>>>>>> final String DATASET_DIR_NAME = "DyLDO100"; >>>>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME >>>>>>> ); >>>>>>> Location location = Location.create(DATASET_DIR_NAME); >>>>>>> StoreParams custom_params = >>>>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc, pDft); >>>>>>> DatabaseConnection.connectCreate(location, custom_params); >>>>>>> StoreParams params = StoreParams.getSmallStoreParams(); >>>>>>> System.out.println(params); >>>>>>> } >>>>>>> } >>>>>>> Please help. >>>>>>> Regards, >>>>>>> Samita Bai >>>>>>> ________________________________ >>>>>>> P : Please consider the environment before printing this e-mail >>>>>>> ________________________________ >>>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments >>>>>>> may contain confidential and privileged information. If you are not the >>>>>>> intended recipient, please notify the sender immediately by return >>>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or >>>>>>> use of this information by a person other than the intended recipient >>>>>>> is unauthorized and may be illegal. >>>>>>> ________________________________ >>>>> >>>>> >>>>> P : Please consider the environment before printing this e-mail >>>>> >>>>> ________________________________ >>>>> >>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >>>>> contain confidential and privileged information. If you are not the >>>>> intended recipient, please notify the sender immediately by return >>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or >>>>> use of this information by a person other than the intended recipient is >>>>> unauthorized and may be illegal. >>>>> >>>>> ________________________________ >>>> >>> >>> P : Please consider the environment before printing this e-mail >>> >>> ________________________________ >>> >>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may >>> contain confidential and privileged information. If you are not the >>> intended recipient, please notify the sender immediately by return e-mail, >>> delete this e-mail and destroy any copies. Any dissemination or use of this >>> information by a person other than the intended recipient is unauthorized >>> and may be illegal. >>> >>> ________________________________ >> > > P : Please consider the environment before printing this e-mail > > ________________________________ > > CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may > contain confidential and privileged information. If you are not the intended > recipient, please notify the sender immediately by return e-mail, delete this > e-mail and destroy any copies. Any dissemination or use of this information > by a person other than the intended recipient is unauthorized and may be > illegal. > > ________________________________ > > P : Please consider the environment before printing this e-mail > > ________________________________ > > CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may > contain confidential and privileged information. If you are not the intended > recipient, please notify the sender immediately by return e-mail, delete this > e-mail and destroy any copies. Any dissemination or use of this information > by a person other than the intended recipient is unauthorized and may be > illegal. > > ________________________________
