I think I have to download the file first as it in nq.gz format. I wrote the
following code which gave me exception
public class ReadQuadInJena {
public static void main(String[] args) {
// TODO Auto-generated method stub
String URL = "http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz";
Location location = Location.create("/home/samita/TDBLoaded");
DatasetGraphTDB dgtdb = DatasetBuilderStd.create(location);
TDBLoader.load(dgtdb, URL);
}}
URL is not valid cz it contains the nq.gz format.
________________________________
From: Andy Seaborne <[email protected]>
Sent: 17 April 2018 01:49:06
To: [email protected]
Subject: Re: TDB 2 Store Parameters
Corrupt input file or the input file has binary in it.
(it means the input is not legal UTF-8)
On 16/04/18 21:20, Samita Bai / PhD CS Scholar @ City Campus wrote:
>
> Dear Andy,
>
>
>
> I got the following exception with TDBLoader for the latest dataset as you
> said.
>
>
> Exception in thread "main" org.apache.jena.atlas.RuntimeIOException:
> java.nio.charset.MalformedInputException: Input length = 1
> at org.apache.jena.atlas.io.IO.exception(IO.java:233)
> at
> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)
> at
> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:154)
> at
> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:137)
> at org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:235)
> at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:229)
> at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:151)
> at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:92)
> at
> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:48)
> at org.apache.jena.riot.lang.RiotParsers.createParser(RiotParsers.java:57)
> at
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:194)
> at org.apache.jena.riot.RDFParser.read(RDFParser.java:334)
> at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:303)
> at org.apache.jena.riot.RDFParser.parse(RDFParser.java:277)
> at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498)
> at org.apache.jena.riot.RDFDataMgr.parseFromURI(RDFDataMgr.java:890)
> at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:680)
> at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:649)
> at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:637)
> at
> org.apache.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:143)
> at
> org.apache.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:109)
> at org.apache.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:252)
> at org.apache.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:184)
> at org.apache.jena.tdb.TDBLoader.load(TDBLoader.java:74)
> at org.apache.jena.tdb.TDBLoader.load(TDBLoader.java:53)
> at org.apache.jena.tdb.TDBLoader.load(TDBLoader.java:44)
> at ldbqPack.ReadQuadInJena.main(ReadQuadInJena.java:42)
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
> at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:281)
> at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
> at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
> at java.base/java.io.Reader.read(Reader.java:140)
> ... 26 more
>
>
>
> ________________________________
> From: ajs6f <[email protected]>
> Sent: 17 April 2018 00:24:12
> To: [email protected]
> Subject: Re: TDB 2 Store Parameters
>
> This appears to be a plain problem in the data. The character "|" should be
> %-escaped. Have you talked with the data providers to figure out why the data
> is invalid? You don't show where this triple comes from, but since Andy had
> no problem loading a more recent data set from the same provider, perhaps you
> can just try that.
>
> Parsing in Jena intentionally defaults to rejecting invalid RDF. That's by
> far the safest approach for a library system like Jena. You can catch an
> exception and ignore the invalid data, and if that works for your
> application, good, or you can try to take some more sophisticated approach.
> But in any event you'll generally be well-advised to clean up the data
> _before_ it goes into your application. For one thing, Jena's tools (e.g.
> tdbloader) expect valid data.
>
> As for your code terminating, you don't show your code with a try-catch, so
> we can't help you very well.
>
> Adam
>
>> On Apr 16, 2018, at 1:50 PM, Samita Bai / PhD CS Scholar @ City Campus
>> <[email protected]> wrote:
>>
>> Even if I am using try catch to catch RiotException but my code still gets
>> terminated on this exception 😞
>>
>>
>> Regards,
>>
>> Samita Bai
>>
>> ________________________________
>> From: Samita Bai / PhD CS Scholar @ City Campus
>> Sent: 16 April 2018 22:32:26
>> To: [email protected]
>> Subject: Re: TDB 2 Store Parameters
>>
>>
>> Yes I am using the same data but of Feb, 2018 as I started experimenting
>> that time. For example for the following piece of code I am getting the
>> error as shown below.
>>
>>
>> public class ReadQuadInJena {
>>
>> public static void main(String[] args) {
>> // TODO Auto-generated method stub
>> String FileName = "/home/samita/Dyldo_DS_4Feb2018/data.nq";
>> DatasetGraph dsg = RDFDataMgr.loadDatasetGraph(FileName);
>> //System.out.println(node);
>> Iterator<Quad> iterQuad = dsg.find();
>> while(iterQuad.hasNext()){
>> System.out.println(iterQuad.next());
>> }
>>
>>
>>
>>
>>
>>
>> Exception in thread "main" org.apache.jena.riot.RiotException: [line: 89841,
>> col: 232] Illegal character in IRI (codepoint 0x7C, '|'):
>> <http://fonts.googleapis.com/css?family=Nunito[|]...>
>> at
>> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147)
>> at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
>> at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
>> at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:67)
>> at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:54)
>> at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
>> at
>> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:195)
>> at org.apache.jena.riot.RDFParser.read(RDFParser.java:334)
>> at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:303)
>> at org.apache.jena.riot.RDFParser.parse(RDFParser.java:277)
>> at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498)
>> at org.apache.jena.riot.RDFDataMgr.parseFromURI(RDFDataMgr.java:890)
>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:519)
>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:486)
>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:439)
>> at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:419)
>> at org.apache.jena.riot.RDFDataMgr.loadDatasetGraph(RDFDataMgr.java:392)
>> at ldbqPack.ReadQuadInJena.main(ReadQuadInJena.java:19)
>>
>>
>>
>> ________________________________
>> From: Andy Seaborne <[email protected]>
>> Sent: 16 April 2018 22:13:36
>> To: [email protected]
>> Subject: Re: TDB 2 Store Parameters
>>
>> I downlaoded
>>
>> http://swse.deri.org/dyldo/data/2016-03-27/data.nq.gz
>>
>> (the latest I could find)
>>
>> and used tdblaoder.
>>
>> Is that the data you are using?
>>
>> Andy
>>
>> On 16/04/18 17:32, ajs6f wrote:
>>> You should be able to check the validity of any of your files just by
>>> running them through Jena's `riot` command.
>>>
>>> You can try loading them into a TDB1 or TDB2 db by using the `tdbloader` or
>>> `tdb2.tdbloader` commands.
>>>
>>> ajs6f
>>>
>>>> On Apr 16, 2018, at 12:28 PM, Samita Bai / PhD CS Scholar @ City Campus
>>>> <[email protected]> wrote:
>>>>
>>>> OK Andy I got your point. Can you please share the code that you used to
>>>> read the Dynamic Linked Data Observatory dataset?
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Samita Bai
>>>>
>>>> ________________________________
>>>> From: Andy Seaborne <[email protected]>
>>>> Sent: 16 April 2018 15:34:07
>>>> To: [email protected]
>>>> Subject: Re: TDB 2 Store Parameters
>>>>
>>>> If you wish to prcoess the data as it is parsed, then see StreamRDF and
>>>> either
>>>>
>>>> NxParser, which is not part of Jena, is not a validating parser.
>>>>
>>>> If the data is not valid, then you will have problems at some point,
>>>> either loading, querying or outputting later.
>>>>
>>>> Adam has explained that TDB2 inxexes heavily so that querying is well
>>>> severed.
>>>>
>>>> We can't help with the parser errors without knowing what they are.
>>>>
>>>> Which files from Dynamic Linked Data Observatory are you processing?
>>>> Don't the later ones replace the earlier ones?
>>>>
>>>> I found that the last n-quads file was 42 million triples and all valid.
>>>>
>>>> Andy
>>>>
>>>> On 16/04/18 11:05, ajs6f wrote:
>>>>> Is there are syntax errors in your RDF (and it sounds like that is why
>>>>> Jena will not read it directly) you are doing yourself no service by
>>>>> taking unusual pains to force TDB to ingest your data.
>>>>>
>>>>> Please show us the errors that Jena is throwing trying to read your data
>>>>> and an appropriate sample of the data in question.
>>>>>
>>>>>
>>>>> ajs6f
>>>>>
>>>>>> On Apr 16, 2018, at 4:42 AM, Samita Bai / PhD CS Scholar @ City Campus
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> In addition to previous query. It is taking a lot of time to first parse
>>>>>> the dataset using NXParser then checking for object, and creating quad
>>>>>> again and storing in TDB. It could be very simple if we can take the
>>>>>> quad check its object and insert it in TDB.
>>>>>>
>>>>>>
>>>>>> But Jena is not helping me with this 😞
>>>>>>
>>>>>>
>>>>>> So I have to create quads again and store it in TDB.
>>>>>>
>>>>>>
>>>>>> Any help is surely appreciated.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Samita Bai
>>>>>>
>>>>>> ________________________________
>>>>>> From: Samita Bai / PhD CS Scholar @ City Campus
>>>>>> Sent: 16 April 2018 13:33:51
>>>>>> To: [email protected]
>>>>>> Subject: Re: TDB 2 Store Parameters
>>>>>>
>>>>>>
>>>>>> Thank you Andy and Adam for the help. Actually, I am just indexing the
>>>>>> quads where object is either literal or foreign URI (i.e. Object
>>>>>> belonging to different dataset than subject), I am using NXParser (as
>>>>>> Jena is giving various parsing errors) to parse the dataset and then I
>>>>>> am storing in TDB2 in the following manner.
>>>>>>
>>>>>>
>>>>>>
>>>>>> public void SetQuadsList(String sub, String pred, String obj, String
>>>>>> context) {
>>>>>> Node subjects = NodeFactory.createURI(sub);
>>>>>> Node objects = NodeFactory.createURI(obj);
>>>>>> Node contexts =NodeFactory.createURI(context);
>>>>>> //Node rdfSeeAlso = RDFS.seeAlso.asNode();
>>>>>>
>>>>>> Node predicates =NodeFactory.createURI(pred);
>>>>>>
>>>>>> //Quad quads = Quad.create(contexts, objects, rdfSeeAlso, subjects);
>>>>>>
>>>>>> Quad quads = Quad.create(contexts, subjects, predicates, objects);
>>>>>>
>>>>>> QuadList.add(quads);
>>>>>>
>>>>>> //System.out.println("Number of backlinks:" + QuadList.size());
>>>>>>
>>>>>> //System.out.println("quad written");
>>>>>>
>>>>>> //System.out.println("Quad"+quads.toString());
>>>>>>
>>>>>> }
>>>>>> public List<Quad> GetQuadsList(){
>>>>>> return QuadList;
>>>>>> }
>>>>>> public void QuadsToTDB(List<Quad> quadList) {
>>>>>> final String DATASET_DIR_NAME = "DyLDO1000K_Index";
>>>>>> Dataset dataset = TDB2Factory.connectDataset ( DATASET_DIR_NAME
>>>>>> );
>>>>>>
>>>>>>
>>>>>> dataset.begin ( ReadWrite.WRITE );
>>>>>> try {
>>>>>> DatasetGraph dsg = dataset.asDatasetGraph();
>>>>>> Iterator<Quad> quads = quadList.iterator();
>>>>>> System.out.println("Size of Quad List: "+quadList.size());
>>>>>> while ( quads.hasNext() ) {
>>>>>> //System.out.println("here");
>>>>>> Quad quad = quads.next();
>>>>>> dsg.add(quad);
>>>>>> //System.out.println(quad.toString()+ "added");
>>>>>> //RDFDataMgr.writeQuads(System.out, quads);
>>>>>> // RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>>>
>>>>>> }
>>>>>> System.out.println("dsg created of size "+dsg.size());
>>>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>>> System.out.println("written dsg using datamgr.");
>>>>>>
>>>>>>
>>>>>> //System.out.println(dataset.isEmpty());
>>>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>>> dataset.commit();
>>>>>>
>>>>>> System.out.println("committed dataset.");
>>>>>>
>>>>>>
>>>>>> } catch ( Exception e ) {
>>>>>> e.printStackTrace(System.err);
>>>>>> //dataset.abort();
>>>>>> } finally {
>>>>>> //RDFDataMgr.write(System.out, dsg, Lang.NQUADS);
>>>>>> dataset.end();
>>>>>>
>>>>>> }
>>>>>> System.out.println("end method.");
>>>>>> }}
>>>>>>
>>>>>>
>>>>>> I have indexed 40,000 files (as I have spilited the dataset into files
>>>>>> according to context) and the index size has become 120 GB. I have a
>>>>>> total of 1,35,600 files whose total size is 19.8 GB only.
>>>>>>
>>>>>>
>>>>>> Why the TDB is making such BIG index size. I am confused :( is there any
>>>>>> problem in my code.
>>>>>>
>>>>>>
>>>>>> Please suggest me if there can be some improvements.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Samita Bai
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: ajs6f <[email protected]>
>>>>>> Sent: 15 April 2018 03:07:59
>>>>>> To: [email protected]
>>>>>> Subject: Re: TDB 2 Store Parameters
>>>>>>
>>>>>> 42 million quads is nothing like so many that either TDB version should
>>>>>> have any problem doing normal indexing (assuming very little in the way
>>>>>> of hardware-- I ingest datasets like that on my laptop all the time).
>>>>>>
>>>>>> Do you have some extraordinary hardware limitations?
>>>>>>
>>>>>> Adam
>>>>>>
>>>>>>> On Apr 14, 2018, at 11:42 AM, Andy Seaborne <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi Samita,
>>>>>>>
>>>>>>> Firstly - as Adam points out - if theer are no indexes then access to
>>>>>>> the data will be very slow. For a GSPO index, that means squeries
>>>>>>> must be "GRAPH <uri> { ... }" and probably "GRAPH <uri> {
>>>>>>> <fixedSubject>.. }".
>>>>>>>
>>>>>>> GSPO means lookup by G then S within those G and the same for P then O.
>>>>>>>
>>>>>>> I looked at the data and it seems to be able 42 million quads.
>>>>>>>
>>>>>>> Using TDB1 (the loader is faster at this scale currently) is likely to
>>>>>>> be a better choice.
>>>>>>>
>>>>>>> Looking at StoreParams in TDB2:
>>>>>>>
>>>>>>> The code below creates the database at TDB2Factory.connectDataset so
>>>>>>> any StoreParams after that do not affect indexing.
>>>>>>>
>>>>>>> I tried to make it work in the release but the code ignores provided
>>>>>>> StoreParams - sorry. Even if it did work, it hits a test to make sure
>>>>>>> there are basic indexing (Adam's point).
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>>
>>>>>>> On 13/04/18 13:42, Samita Bai / PhD CS Scholar @ City Campus wrote:
>>>>>>>> I wrote the following code to build only one type of triple and quad
>>>>>>>> index but it is still creating all indexes 😞
>>>>>>>> package ldbqPack;
>>>>>>>> import org.apache.jena.query.Dataset;
>>>>>>>> import org.apache.jena.tdb2.TDB2Factory;
>>>>>>>> import org.apache.jena.tdb2.setup.StoreParams;
>>>>>>>> import org.apache.jena.tdb2.sys.DatabaseConnection;
>>>>>>>> import org.apache.jena.dboe.base.block.FileMode;
>>>>>>>> import org.apache.jena.dboe.base.file.Location;
>>>>>>>> import org.apache.jena.tdb2.setup.StoreParamsFactory;
>>>>>>>> public class StrPrms {
>>>>>>>> static String[] tindexes= {"SPO"};
>>>>>>>> static String[] qindexes= {"GSPO"};
>>>>>>>> static String[] pindexes= {"GPU"};
>>>>>>>> static final StoreParams pApp = StoreParams.builder()
>>>>>>>> .blockSize(12) // Not dynamic
>>>>>>>> .nodeMissCacheSize(12) // Dynamic
>>>>>>>> .build();
>>>>>>>> static final StoreParams pLoc = StoreParams.builder()
>>>>>>>> .blockSize(0)
>>>>>>>> .nodeMissCacheSize(0).build();
>>>>>>>> static final StoreParams pDft = StoreParams.builder()
>>>>>>>> .fileMode(FileMode.mapped)
>>>>>>>> .blockSize(8192)
>>>>>>>> .blockReadCacheSize(5000)
>>>>>>>> .blockWriteCacheSize(1000)
>>>>>>>> .node2NodeIdCacheSize(200000)
>>>>>>>> .nodeId2NodeCacheSize(750000)
>>>>>>>> .nodeMissCacheSize(1000)
>>>>>>>> .nodeTableBaseName("nodes")
>>>>>>>> .primaryIndexTriples("SPO")
>>>>>>>> .tripleIndexes(tindexes)
>>>>>>>> .primaryIndexQuads("GSPO")
>>>>>>>> .quadIndexes(qindexes)
>>>>>>>> .prefixTableBaseName("prefixes")
>>>>>>>> .primaryIndexPrefix("GPU")
>>>>>>>> .prefixIndexes(pindexes)
>>>>>>>> .build();
>>>>>>>> public static void main(String[] args) {
>>>>>>>> // TODO Auto-generated method stub
>>>>>>>> final String DATASET_DIR_NAME = "DyLDO100";
>>>>>>>> Dataset dataset = TDB2Factory.connectDataset (
>>>>>>>> DATASET_DIR_NAME );
>>>>>>>> Location location = Location.create(DATASET_DIR_NAME);
>>>>>>>> StoreParams custom_params =
>>>>>>>> StoreParamsFactory.decideStoreParams(location, true, pApp, pLoc,
>>>>>>>> pDft);
>>>>>>>> DatabaseConnection.connectCreate(location, custom_params);
>>>>>>>> StoreParams params = StoreParams.getSmallStoreParams();
>>>>>>>> System.out.println(params);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> Please help.
>>>>>>>> Regards,
>>>>>>>> Samita Bai
>>>>>>>> ________________________________
>>>>>>>> P : Please consider the environment before printing this e-mail
>>>>>>>> ________________________________
>>>>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments
>>>>>>>> may contain confidential and privileged information. If you are not
>>>>>>>> the intended recipient, please notify the sender immediately by return
>>>>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination
>>>>>>>> or use of this information by a person other than the intended
>>>>>>>> recipient is unauthorized and may be illegal.
>>>>>>>> ________________________________
>>>>>>
>>>>>>
>>>>>> P : Please consider the environment before printing this e-mail
>>>>>>
>>>>>> ________________________________
>>>>>>
>>>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
>>>>>> contain confidential and privileged information. If you are not the
>>>>>> intended recipient, please notify the sender immediately by return
>>>>>> e-mail, delete this e-mail and destroy any copies. Any dissemination or
>>>>>> use of this information by a person other than the intended recipient is
>>>>>> unauthorized and may be illegal.
>>>>>>
>>>>>> ________________________________
>>>>>
>>>>
>>>> P : Please consider the environment before printing this e-mail
>>>>
>>>> ________________________________
>>>>
>>>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
>>>> contain confidential and privileged information. If you are not the
>>>> intended recipient, please notify the sender immediately by return e-mail,
>>>> delete this e-mail and destroy any copies. Any dissemination or use of
>>>> this information by a person other than the intended recipient is
>>>> unauthorized and may be illegal.
>>>>
>>>> ________________________________
>>>
>>
>> P : Please consider the environment before printing this e-mail
>>
>> ________________________________
>>
>> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
>> contain confidential and privileged information. If you are not the intended
>> recipient, please notify the sender immediately by return e-mail, delete
>> this e-mail and destroy any copies. Any dissemination or use of this
>> information by a person other than the intended recipient is unauthorized
>> and may be illegal.
>>
>> ________________________________
>
>
> P : Please consider the environment before printing this e-mail
>
> ________________________________
>
> CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
> contain confidential and privileged information. If you are not the intended
> recipient, please notify the sender immediately by return e-mail, delete this
> e-mail and destroy any copies. Any dissemination or use of this information
> by a person other than the intended recipient is unauthorized and may be
> illegal.
>
> ________________________________
>
P : Please consider the environment before printing this e-mail
________________________________
CONFIDENTIALITY / DISCLAIMER NOTICE: This e-mail and any attachments may
contain confidential and privileged information. If you are not the intended
recipient, please notify the sender immediately by return e-mail, delete this
e-mail and destroy any copies. Any dissemination or use of this information by
a person other than the intended recipient is unauthorized and may be illegal.
________________________________