Re: Apache Jena tdbloader performance and limits

2020-05-21 Thread Dick Murray
I've just finished downloading the WikiPedia latest-truthy.nt.gz (39G) and
decompressing (605G) in ~10 hours using Ubuntu 19.10 on a Raspberry Pi4
using a USB3 1TB HDD.

I'll update you on the sort and uniq (from memory there were not that my
duplicates).

Dick


On Wed, 20 May 2020 at 11:21, Wolfgang Fahl  wrote:

> Thank you Dick for your response.
>
> > Basically, you need hardware!
> That option is very limited with my budget and my current 64 GByte
> Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
> GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
> hardware especially with the risk of thing crashing anyway.
>
> The splitting option you mention seems to be a lot of extra hassle and I
> assume this is based on the approach of "import all of WikiData".
> Currently i see that the hurdles for doing such a "full import" are very
> high. For my usecase I might be able to put up with some 3-5% of
> Wikidata since I am basically interested in what
> https://www.wikidata.org/wiki/Wikidata:Scholia offers for the
>
> https://projects.tib.eu/confident/ ConfIDent project.
>
> What kind of tuning besides the hardware was effective for you?
>
> Does anybody have experience with partial dumps created by
> https://tools.wmflabs.org/wdumps/?
>
> Cheers
>
>   Wolfgang
>
> Am 20.05.20 um 11:22 schrieb Dick Murray:
> > That's a blast from the past!
> >
> > Not all of the details from that exchange are on the Jean list because
> > Laura and myself took the conversation offline...
> >
> > The short story is I imported the WikiData in 3 days using an IBM 24 core
> > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> > plenty of cycles for the OS to be doing housekeeping, and there was a lot
> > of housekeeping!
> >
> > Basically, you need hardware!
> >
> > I managed to reduce this time to a day by performing 4 imports in
> parallel.
> > This was only possible because my server could absorb this amount of
> > throughput.
> >
> > Importing in parallel resulted in 4 TDB's which were queried using a beta
> > Jena extension (known as Mosaic internally). This has it's own issues
> such
> > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> > actions, using Java streams. This led to further work whereby
> preprocessing
> > was performed to guarantee that each quad was unique in the 4 TDB's,
> which
> > meant the .distinct() could be skipped in the stream processing.
> >
> > About a year ago I performed that same test on a Ryzen 2950X based
> system,
> > using the same disks plus 3 M.2 drives and received similar results.
> >
> > You also need to consider what bzip2 lzmash level was used to compress.
> > Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> > reduce the compressed file as much as possible.
> >
> >
> > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl  wrote:
> >
> >> Dear Apache Jena users,
> >>
> >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> >> list on how to influence the performance of
> >> tdbloader. The issue is currently of interest for me again in the
> context
> >> of trying to load some 15 billion triples from a
> >> copy of wikidata. At
> >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> >> documented what i am trying to accomplish
> >> and a few days ago I placed a question on stackoverflow
> >>
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> >> with the following three questions:
> >>
> >> *What is proven to speed up the import without investing into extra
> >> hardware?*
> >> e.g. splitting the files, changing VM arguments, running multiple
> >> processes ...
> >>
> >> *What explains the decreasing speed at higher numbers of triples and how
> >> can this be avoided?*
> >>
> >> *What sucessful multi-billion triple imports for Jena do you know of and
> >> what are the circumstances for these?*
> >>
> >> There were some 50 fews on the question so far and some comments but
> there
> >> is no real hint yet on what could improve things.
> >>
> >> Especially the Java VM crashes that happened with different Java
> >> environments on the Mac OSX machine are disappointing since event with a
> >> slow speed the import would have been 

Re: Apache Jena tdbloader performance and limits

2020-05-20 Thread Dick Murray
Laura had a very specific requirement to load the whole of WikiData which I
believe is ~100GB in bz2 format.

The split isn't too complex, the uncompressed file was run through the
sort, then uniq and then split. Split was run with both -b and -l, because
some lines are very long!

These files were then loaded into separate TDB's. I have a script somewhere
which will download the bz2 file, apply the above processing, then bulkload
each output into a TDB.

I no longer work for the company I created this solution for, they were
importing CAD drawings as RDF which produced billions of triples. That said
we regularly imported 25B triples...

I do however now have access to 128 core 2048GB servers so I may revisit
what can be achieved with the loading triples.

On Wed, 20 May 2020 at 11:21, Wolfgang Fahl  wrote:

> Thank you Dick for your response.
>
> > Basically, you need hardware!
> That option is very limited with my budget and my current 64 GByte
> Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
> GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
> hardware especially with the risk of thing crashing anyway.
>
> The splitting option you mention seems to be a lot of extra hassle and I
> assume this is based on the approach of "import all of WikiData".
> Currently i see that the hurdles for doing such a "full import" are very
> high. For my usecase I might be able to put up with some 3-5% of
> Wikidata since I am basically interested in what
> https://www.wikidata.org/wiki/Wikidata:Scholia offers for the
>
> https://projects.tib.eu/confident/ ConfIDent project.
>
> What kind of tuning besides the hardware was effective for you?
>
> Does anybody have experience with partial dumps created by
> https://tools.wmflabs.org/wdumps/?
>
> Cheers
>
>   Wolfgang
>
> Am 20.05.20 um 11:22 schrieb Dick Murray:
> > That's a blast from the past!
> >
> > Not all of the details from that exchange are on the Jean list because
> > Laura and myself took the conversation offline...
> >
> > The short story is I imported the WikiData in 3 days using an IBM 24 core
> > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> > plenty of cycles for the OS to be doing housekeeping, and there was a lot
> > of housekeeping!
> >
> > Basically, you need hardware!
> >
> > I managed to reduce this time to a day by performing 4 imports in
> parallel.
> > This was only possible because my server could absorb this amount of
> > throughput.
> >
> > Importing in parallel resulted in 4 TDB's which were queried using a beta
> > Jena extension (known as Mosaic internally). This has it's own issues
> such
> > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> > actions, using Java streams. This led to further work whereby
> preprocessing
> > was performed to guarantee that each quad was unique in the 4 TDB's,
> which
> > meant the .distinct() could be skipped in the stream processing.
> >
> > About a year ago I performed that same test on a Ryzen 2950X based
> system,
> > using the same disks plus 3 M.2 drives and received similar results.
> >
> > You also need to consider what bzip2 lzmash level was used to compress.
> > Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> > reduce the compressed file as much as possible.
> >
> >
> > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl  wrote:
> >
> >> Dear Apache Jena users,
> >>
> >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> >> list on how to influence the performance of
> >> tdbloader. The issue is currently of interest for me again in the
> context
> >> of trying to load some 15 billion triples from a
> >> copy of wikidata. At
> >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> >> documented what i am trying to accomplish
> >> and a few days ago I placed a question on stackoverflow
> >>
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> >> with the following three questions:
> >>
> >> *What is proven to speed up the import without investing into extra
> >> hardware?*
> >> e.g. splitting the files, changing VM arguments, running multiple
> >> processes ...
> >>
> >> *What explains the decreasing speed at higher numbers of triples and how
> >> can this be avoided?*
> >>
> >> *What sucessful multi-billion triple impo

Re: Apache Jena tdbloader performance and limits

2020-05-20 Thread Dick Murray
That's a blast from the past!

Not all of the details from that exchange are on the Jean list because
Laura and myself took the conversation offline...

The short story is I imported the WikiData in 3 days using an IBM 24 core
512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
plenty of cycles for the OS to be doing housekeeping, and there was a lot
of housekeeping!

Basically, you need hardware!

I managed to reduce this time to a day by performing 4 imports in parallel.
This was only possible because my server could absorb this amount of
throughput.

Importing in parallel resulted in 4 TDB's which were queried using a beta
Jena extension (known as Mosaic internally). This has it's own issues such
as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
actions, using Java streams. This led to further work whereby preprocessing
was performed to guarantee that each quad was unique in the 4 TDB's, which
meant the .distinct() could be skipped in the stream processing.

About a year ago I performed that same test on a Ryzen 2950X based system,
using the same disks plus 3 M.2 drives and received similar results.

You also need to consider what bzip2 lzmash level was used to compress.
Wiki use bzip2 because of it's aggressive compression, i.e. they want to
reduce the compressed file as much as possible.


On Wed, 20 May 2020 at 06:56, Wolfgang Fahl  wrote:

> Dear Apache Jena users,
>
> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> list on how to influence the performance of
> tdbloader. The issue is currently of interest for me again in the context
> of trying to load some 15 billion triples from a
> copy of wikidata. At
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> documented what i am trying to accomplish
> and a few days ago I placed a question on stackoverflow
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> with the following three questions:
>
> *What is proven to speed up the import without investing into extra
> hardware?*
> e.g. splitting the files, changing VM arguments, running multiple
> processes ...
>
> *What explains the decreasing speed at higher numbers of triples and how
> can this be avoided?*
>
> *What sucessful multi-billion triple imports for Jena do you know of and
> what are the circumstances for these?*
>
> There were some 50 fews on the question so far and some comments but there
> is no real hint yet on what could improve things.
>
> Especially the Java VM crashes that happened with different Java
> environments on the Mac OSX machine are disappointing since event with a
> slow speed the import would have been finished after a  while but with a
> crash its a never ending story.
>
> I am curious to learn what your experience and advice is.
>
> Yours
>
>   Wolfgang
>
> --
>
>
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
>
>


Detecting writes natively to a DatasetGraph since a particular epoch

2019-10-23 Thread Dick Murray
Hi.

Is it possible to natively detect whether a write has occurred to a
DatasetGraph since a particular epoch?

For the purposes of caching if I perform an expensive read from a
DatasetGraph knowing whether I need to invalidate the cache is very useful.
Does TDB or the Mem natively track if a write has occurred?

Currently I am wrapping the Transactional but am interested if this can be
shimmed to the underlying SPI.

Regards DickM


Re: sparql 1.4 billion triples

2018-12-16 Thread Dick Murray
Be very careful using vmtouch especially if you call -dl as you could very
easily and quickly kill a system. I've used this tool on cloud VM's to
mitigate cycle times, think DBAN due to public nature of hardware. It's a
fast way to an irked OS thrashing around.

Dick

On Sun, 16 Dec 2018 19:57 Siddhesh Rane  I'll be happy to document this. I think FAQ would be a good place.
>
> I actually looked further into this and found that the vmtouch
> functionality is provided in the jdk itself.
> java.nio.MappedByteBuffer#load method will bring file pages in memory [1].
> The way it works is similar to vmtouch, i.e. reading a byte from each page
> to cause page fault and load that page in memory [2].
>
> [1]
>
> https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html#load--
>
> [2]
>
> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/nio/MappedByteBuffer.java#l156
>
>
> On Sun, 16 Dec 2018, 6:59 pm ajs6f 
> > This seems to be a Linux-only technique that relies on installing and
> > maintaining vmtouch, correct?
> >
> > It doesn't seem that we could support that as a general solution, but
> > would you be interested in writing something that gives the essentials up
> > for someplace in the Jena docs? I'll admit I'm not sure where it would
> best
> > go, but it might be very helpful to users who can take advantage of it.
> >
> > ajs6f
> >
> > > On Dec 16, 2018, at 6:11 AM, Siddhesh Rane 
> wrote:
> > >
> > > In-memory database has following limitations :
> > >
> > > 1) Time to create the database. Not a problem if you have a dedicated
> > > machine which runs 24/7 where you load data once and the process never
> > > exits. But a huge waste of time if you get hardware during certain time
> > > slots and you have to load data from the start.
> > >
> > > 2) In-memory database is all or nothing. If your dataset can't fit in
> > RAM,
> > > you are out of luck. I had tried using this but many times it would go
> > OOM.
> > > With vmtouch, you can load an index partially, until as much free RAM
> is
> > > available. Something is better than nothing.
> > >
> > > Vmtouch is not doing anything magical. Tdb already uses mmap. When run
> on
> > > its own, Linux will bring most of the index in RAM. But think about the
> > > time it will take for that to happen. If one query takes 50 seconds
> (I've
> > > seen it go to 500-1000s as well), then in 1 hour you would have run
> just
> > 72
> > > queries. If instead your speed was 1s/query you would have executed
> 3600
> > > queries and that would bring more of the index in RAM for future
> queries
> > to
> > > run fast as well. So its also the rate of speedup that matters.
> > > With vmtouch, you vmtouch at the beginning and it gives you a fast head
> > > start and then its your program maintaining the cache.
> > >
> > > Regards,
> > > Siddhesh
> > >
> > >
> > > On Sat, 15 Dec 2018, 9:15 pm ajs6f  > >
> > >> What is the advantage to doing that as opposed to using Jena's
> built-in
> > >> in-memory dataset?
> > >>
> > >> ajs6f
> > >>
> > >>> On Dec 15, 2018, at 3:04 AM, Siddhesh Rane 
> > wrote:
> > >>>
> > >>> Bring the entire database in RAM.
> > >>> Use "vmtouch "
> > >>> Get vmtouch from https://hoytech.com/vmtouch/
> > >>>
> > >>> I had used jena for 150M triples and my performance findings are
> > >> documented
> > >>> at
> > >>>
> > >>
> >
> https://lists.apache.org/thread.html/254968eee3cd04370eafa2f9cc586e238f8a7034cf9ab4cbde3dc8e9@%3Cusers.jena.apache.org%3E
> > >>>
> > >>> Regards,
> > >>> Siddhesh
> > >>>
> > >>> On Fri, 7 Dec 2018, 8:23 pm y...@zju.edu.cn  > >>>
> >  Dear jena,
> >  I have built a graph with 1.4 billion triples and store it as a data
> > set
> >  in TDB  through Fuseki upload system.
> >  Now, I try to make some sparql search, the speed is very slow.
> > 
> >  For example, when I make the sqarql in Fuseki in the following, it
> > takes
> >  50 seconds.
> >  How can I improve the speed?
> >  --
> >  Best wishes!
> > 
> > 
> >  胡云苹
> >  浙江大学控制科学与工程学院
> >  浙江省杭州市浙大路38号浙大玉泉校区CSC研究所
> >  Institute of Cyber-Systems and Control, College of Control Science
> and
> >  Engineering, Zhejiang University, Hangzhou 310027,P.R.China
> >  Email : y...@zju.edu.cn ;hyphy...@163.com
> > 
> > 
> > >>
> > >>
> >
> >
>


Re: Multiple Fuseki Servers in Distributed Environment

2018-06-01 Thread Dick Murray
Apologies for resurrecting this thread...

Yes, it uses Thrift when distributed, ie multi JVM.

It was on hold because I changed jobs, yay!

I'm starting to look at making it available as a Jena side car, ie
jena-mosaic.

DickM

On 27 May 2018 at 12:02, ajs6f  wrote:

> There are several systems that distribute SPARQL using Jena.
>
> Dick Murray has written a system called Mosaic that (I believe) uses
> Apache Thrift to distribute the lower-level (DatasetGraph) primitives that
> ARQ uses to execute SPARQL. An advantage over your plan might be that he
> isn't serializing full results over HTTP to pass them around. I don't
> understand that system to be ready for use outside of Dick's deployment,
> but he could say more.
>
> The SANSA project [1] has provided a system that I understand to use ARQ
> to execute queries over Apache Spark or Apache Flink. This sounds similar
> in some ways to what you are doing, and that system is available today. I
> think Jena committer Lorenz Bühmann is involved with that project; if I am
> correct, he may be able to say more.
>
> There are doubtless others about which I don't know.
>
> ajs6f
>
> [1] http://sansa-stack.net/
>
> > On May 26, 2018, at 5:47 AM, Mirko Kämpf  wrote:
> >
> > Hello Fuseki experts,
> >
> > I want to ask you for your experience / thoughts about the following
> > approach:
> >
> >
> >
> > In order to enable semantic queries over "trancient data" or on data
> which
> > is persisted in HDFS / HBase I
> > execute a Fuseki Server (standalone or embedded) on each cluster node,
> > which hosts a Spark Executor.
> >
> > Since the data is partitioned I will not have references between the
> > datasets (in this particular case).
> >
> > A simple query broker allows distributing the query and consolidation of
> > results. Next thing would be adding
> > a coordinator with graph statistics for optimization of data set dumps
> and
> > reloading in case of failure.
> >
> > A load balancer is used to balance request and result flows towards
> > clients, eventually, the query broker will run in Docker.
> >
> > A sketch is available here:
> > https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
> > Fuseki%20Cloud.png
> >
> >
> >
> > My initial prototype works well. Now I want go deeper. But I wonder, if
> > such an activity has already been started or if
> > you know reasons, why this is not a good approach.
> >
> > In any case, if there is no reason for not implementing such a
> > "Fuseki-Cloud" approach - I continue on that route and
> > I want to contribute the results to the existing project.
> >
> > Thanks for any hint or recommendation.
> >
> > Best wishes,
> > Mirko
>
>


Re: TDB2 and bulk loading

2018-03-19 Thread Dick Murray
Slow needs to be qualified. Slow because you need to load 1MT in 10s? What
hardware? What environment? Are you loading a line based serialization? Are
you loading from scratch or appending?

D

On Mon, 19 Mar 2018, 10:51 Davide,  wrote:

> Hi,
> What is the best way to perform the bulk loading with TDB2 and Java API?
> Because I used the bulkloader with TDB1, but when I store data, it's too
> slow.
>


Re: client/server communication protocol

2018-03-13 Thread Dick Murray
>From an enterprise perspective http is well supported with years of
development in associated stacks, such as load balancing etc. It also
allows Devs to use different languages. That said we also employ Thrift
based DGs which allow direct access from Python etc. It doesn't remove the
overhead, it just replaces http with thrift, plus the dev needs to know the
Jena API...

On Tue, 13 Mar 2018, 07:35 Laura Morales,  wrote:

> I forgot to mention that I'm not looking at this from the perspective of a
> user who wants to use a public endpoint. I'm looking at this from the
> perspective of a developer making a website and using Jena/Fuseki as a
> non-public backend database.
>
>
>
>
> Sent: Tuesday, March 13, 2018 at 8:29 AM
> From: "Laura Morales" 
> To: users@jena.apache.org
> Cc: users@jena.apache.org
> Subject: Re: client/server communication protocol
> Am not saying one is better or worse than the other, I'm merely trying to
> understand. If I understand correctly Fuseki is responsible for handling
> connections, after then it passes my query to Jena which essentially will
> parse my query and retrieve the data from a memory mapped file (TDB).
> Since MySQL/Postgres use a custom binary protocol, I'm simply asking
> myself if HTTP adds too much overhead and latency (and therefore is
> significantly slower when dealing with a lot of requests) compared to a
> custom protocol programmed on a lower level socket.
>
>
>
>
> Sent: Tuesday, March 13, 2018 at 8:11 AM
> From: "Lorenz Buehmann" 
> To: users@jena.apache.org
> Subject: Re: client/server communication protocol
> Well, Fuseki is exactly the HTTP layer on top of Jena. Without Fuseki,
> which protocol do you want to use to communicate with Jena? The SPARQL
> protocol [1] perfectly standardizes the communication via HTTP. Without
> Fuseki, who should do the HTTP handling? Clearly, you could setup your
> own Java server and do all the communication by yourself, e.g. using low
> level sockets etc. - whether this makes sense, I don't know. I'd always
> prefer standards, especially if you already have something like Fuseki
> which does all the connection handling.
>
>
> [1] https://www.w3.org/TR/sparql11-http-rdf-update/
>


Re: Best way to save a large amount of triples in TDB

2018-03-12 Thread Dick Murray
On Mon, 12 Mar 2018, 09:27 Davide Curcio,  wrote:

> Hi,
> I want to store a large amount of data inside the TDB server with the
>

Quantity or size on disk?

Jena API. In my code, I retrieve data for each iteration, and so I need
> to store these data in TDB, but if I create all statements with Jena
> API, for each iteration, before load data in the server, obviously I've
> problems with RAM. But if I try to commit data for each iterator in the
> server, and so open and close write transaction each time, obviously
> it's too slow. What's the best way to do this?
>

Standard bulk load as per any storage system...


> Thanks
>
>


Re: PrefixMapStd abbrev call to strSafeFor understanding

2018-01-19 Thread Dick Murray
Cheers Andy.

I need to perform some prefix compression and was hoping to hook into the
existing functionality of the prefix map from within a stream rdf. As it's
a protected method I've overridden it to return true because I write the
pair similar to the thrift code so there's no ambiguity when I recreate the
node.

On 19 Jan 2018 13:56, "Andy Seaborne" <a...@apache.org> wrote:



On 18/01/18 16:48, Dick Murray wrote:

> Is it possible to get a Pair<String, String> lexvo (left) code/002 (right)
> from abbrev  given the prefix map entry;
>

In Turtle the "/"  would need escaping as "\/".



> lexvo http://lexvo.org/id/
>
> and the URI;
>
> http://lexvo.org/id/code/002
>
> PrefixMapStd (actually base call) returns null because the call to;
>
> protected Pair<String, String> abbrev(Map<String, IRI> prefixes, String
> uriStr, boolean checkLocalPart)
>
> has checkLocalPart as true and the call to strSafeFor fails the "/" test.
>

That's a quick, partial check and there there is a later test as well in
NodeFormatterTTL.

Neither allow for rewriting the local part (at the moment).

Andy


PrefixMapStd abbrev call to strSafeFor understanding

2018-01-18 Thread Dick Murray
Is it possible to get a Pair lexvo (left) code/002 (right)
from abbrev  given the prefix map entry;

lexvo http://lexvo.org/id/

and the URI;

http://lexvo.org/id/code/002

PrefixMapStd (actually base call) returns null because the call to;

protected Pair abbrev(Map prefixes, String
uriStr, boolean checkLocalPart)

has checkLocalPart as true and the call to strSafeFor fails the "/" test.


Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

2017-12-26 Thread Dick Murray
That's one graph in many pieces and the owner of the graph should clearly
state what is what!

On 26 Dec 2017 20:28, "Laura Morales"  wrote:

> Blank node identifiers are only limited in scope to a serialization of a
> particular RDF graph, i.e. the node _:b does not represent the same node
as
> a node named _:b in any other graph.

Yes I understand this, but I've seen some projects distribute their data as
one graph split into multiple files (eg one file per item).


Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

2017-12-26 Thread Dick Murray
On 26 Dec 2017 19:10, "Laura Morales"  wrote:

> What is more, it gets bNode labels across files right (so using _:a in
> two files is two bNodes).

Thinking about this...

- if the files contain anonymous blank nodes (for example in Turtle), each
node (converted with RIOT) should be assigned a random name (this is where
rapper fails, and RIOT works)
- if the files already contain named blank nodes (eg _:node1 
) then I guess these nodes should probably keep their names and not
be reassigned a random ID, because they are probably intended to mean the
same blank node


Blank node identifiers are only limited in scope to a serialization of a
particular RDF graph, i.e. the node _:b does not represent the same node as
a node named _:b in any other graph.


Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

2017-12-25 Thread Dick Murray
That seems slow for the size.

We bulk load triples into Windows and get similar times to Centos/Fedora on
the same hardware.

You can hack the tdbloader2 to run on Windows as basically you're
exploiting the OS sort which on Windows is;

*sort* [*/r*] [*/+**n*] [*/m* *kilobytes*] [*/l* *locale*] [*/rec*
*characters*] [[*drive1**:*][*path1*]*filename1*] [*/t* [*drive2**:*][
*path2*]] [*/o* [*drive3**:*][*path3*]*filename3*]

Merge all the files together using copy *.txt newfile.txt This assumes you
understand the blank nodes..?

Use unique from gnu utils for Windows or the following native

@ECHO ON

SET InputFile=C:\folder\path\Input.txt
::SET InputFile=%~1
SET OutputFile=C:\folder\path\Output.txt

SET PSScript=%Temp%\~tmpRemoveDupe.ps1
IF EXIST "%PSScript%" DEL /Q /F "%PSScript%"
ECHO Get-Content "%InputFile%" ^| Sort-Object ^| Get-Unique ^>
"%OutputFile%">>"%PSScript%"

SET PowerShellDir=C:\Windows\System32\WindowsPowerShell\v1.0
CD /D "%PowerShellDir%"
Powershell -ExecutionPolicy Bypass -Command "& '%PSScript%'"

GOTO EOF



If you do the *SET InputFile=%~1 Window* will allow you to drag and drop
the source file into the CMD... Got to be some advantage to using Windows.!?

Dick

On 25 Dec 2017 4:51 am, "Shengyu Li" 
wrote:

Hello,

I am uploading my .ttl data to my database, there are totally about 10,000
files and each file is about 4M. My new data is totally about 40GB. My
origional db is also about 40GB. The server is in my local computer.

I use tdbloader.bat --loc to upload data. After the Finish quads load, it
will pause at this status for a long time (about half an hr for one file
(4M), but if for 200 files one time(200*4M), the pause time will be 2 hrs).
After the pause, the work will go back to the cmd.
[image: Inline image 1]

I guess the pause means the db is doing the organization about the data I
uploaded just now, so won't return for a long time, am I right? Is there
any way to shorten the waiting time?

Thank you very much! Jena is really a useful thing!

Best,
Shengyu


Re: Operational issues with TDB

2017-12-22 Thread Dick Murray
How big? How many?

On 22 Dec 2017 8:37 pm, "Dimov, Stefan"  wrote:

> Hi all,
>
> We have a project, which we’re trying to productize and we’re facing
> certain operational issues with big size files. Especially with copying and
> maintaining them on the productive cloud hardware (application nodes).
>
> Did anybody have similar issues? How did you resolve them?
>
> I will appreciate if someone shares their experience/problems/solutions.
>
> Regards,
> Stefan
>


Re: Very very slow query when using a high OFFSET

2017-12-18 Thread Dick Murray
On 18 December 2017 at 08:07, Laura Morales  wrote:

> > The don't have index permutations spo, ops, pos, etc.
>
> Yes they have, what you're saying is wrong. See http://www.rdfhdt.org/hdt-
> binary-format/#triples That's what the .hdt.index file is about, to store
> more index permutations.
>

This is going off Jena list but do we know how the wiki HDT was compiled
because having read the technical stuff including the link above the the
$$streamsOrder property (which defaults to SPO) sets the triple index
order. Can you query the HDT header and see what this is set to? 0 = SPO,
>=1 SOP, etc. Also check $$IDCodificationBits because Wiki blew the
original HDT code as it exceeded 2^32 triples and there was a new 64 id
code base in dev. Plus how big is the generated .hdt.index file (it's in
the same folder as the .hdt file), this file is autogen as soon as you try
and search the HDT.

As previously mentioned this is best off this list, so dick-twocows on
github.


>
>
> > To bring this thread to an end, I guess we finally answered your
> > question? Or are the any open issues?
>
> I think the only remaining open questions are:
>
> - since the problem was not with the OFFSET, would the query "SELECT ?s
> FROM  WHERE ..." also fail to terminate with a TDB-backed
> namedGraph (instead of HDT)?
>
> - is there any improvement that can be added to Jena to solve these type
> of queries faster, or is it just the way it is and nothing can be done
> about it?
>


Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
On 12 Dec 2017 21:06, "Laura Morales" <laure...@mail.com> wrote:

Ok so, this is very interesting. You got 400K TPS which is a very
interesting number, and on 5400rpm disks nonetheless! A couple of points:

1) your test seems to suggest that it's possible to load a huge dataset
quickly by running multiple loaders in parallel.


Yes.

This

means, if I'm right, that I can just add more SATA disks to my computer, or
even mount remote disks from other nodes in my network. Pretty cool but...


Up to a limit.

I don't understand how you merge the result into a single dataset? You've
loaded 4 stores in parallel... did you also merge them then?

2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K
TPS) but then it quickly slows down *a lot* over time, like 4K TPS or less.


That slow down is your split point. It's also possible that your hardware
just won't be able to load in parallel.

And I'm not convinced it's a problem of disk cache either, because I tried
to flush it several times, but the disk was always getting slower and
slower as more triples were added (1MB/s writes!!!). So, didn't you
experience the same issue with your 5400rpm disks?


Your IO is saturated. Not yet but at some point above 200M lines I will
too, so I either reduce the lines or increase the IO (currently on X399
chipset which supports 8 6G SATA devices). You are looking for the sweet
spot and that's different between hardware. My IO is huge because I'm using
a Ryzen but eventually I'll saturate it.




Sent: Tuesday, December 12, 2017 at 9:20 PM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
tdbloader2

For anyone still following this thread ;-)

latest-truthy supposedly contains just the verified facts, this is
Wikipedia...

latest-truthy is unsorted and contains duplicates, running sort then uniq
yields 61K+ duplicates ranging from 2 to 100+. Running sort takes a while!
Whilst it's not going to reduce the line count hugely (-3M lines) it's
worth considering when doing any import.

Fastest elapsed load TPS currently is ~405K (got there Andy) which was
achieved by splitting the file into 200M line files using split, running
four concurrent loads into four TDBs, each TDB on a separate 5400rpm 6G
drive, tdbloader2 script was hacked to run sort parallel 6 buffer 8G
temporary appropriate drive, repeat 3 times to give 12 TDB instances, query
via my Mosaic extension. I'll up the file size until the drive saturates
and stalls then drop it back and run it concurrently as the stall appears
to occur on the drive write. Currently perform the index data-triple sort
in parallel but write the indexes sequentially. The drives were stolen from
some old laptops so not exactly bling hardware.

On the subject of performance it's possible to cascade the split if you
have enough drives, split file in half and when the second file is created
split the first one in half and so on, uses inotify in a script.

While I aim to "load" truthy in under an hour that won't account for
getting the file, uncompressing the file (non parallel bzip2!!!), splitting
file, etc, but for marketing purposes who cares... ;-)



On 11 Dec 2017 18:43, "Laura Morales" <laure...@mail.com> wrote:

Did you run your Threadripper test using tdbloader, tdbloader2, or
tdb2.tdbloader?

@Andy where can I find a description of TDB1/2 binary format (how stuff is
stored in the files)?



Sent: Monday, December 11, 2017 at 11:31 AM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
Inline...

On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote:

> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or higher?
>

2133


> > 3 x M.2 Samsung 960 EVO
>
> Are these PCI-e disks? Or SATA? Also, what size and configuration?


PCIe Turbo


> > Is it possible to split the index files into separate folders?
> > Or sym link the files, if I run the data phase, sym link, then run the
> index phase?
>
> What would you gain from this?
>

n index files need to be written so split the load across multiple devices,
be that cores/controllers/storage. Potentially use a fast/expensive device
to perform the load and copy the files over to a production grade device.
Load device would have no redundancy as who cares if it throws a drive?
Production devices are redundant as 5 9's requirement.


>
> > 172K/sec 3h45m for truthy.
>
> It still feels slow considering that you throw such a powerful machine to
> it, but it's very interesting nonetheless! What I think after these tests,
> is that the larger impact here is given by the M.2 disks


Its also got 2 x SATAIII 6G drives and the load time doesn't increase by
much using these. There's a

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Correct, Mosaic federates multiple datasets as one. At some point in a
query find [G]SPO will get called and Mosaic will concurrently call find on
each child dataset and return the set of results. The dataset can be memory
or TDB or Thrift (this one's another discussion) Mosaic doesn't care as
long as it implements DatasetGraph. The child calls use parallel streams
and distinct or find first as appropriate. Transactions are supported via
ThreadProxy and delayed until needed because parallel streams use fork join
pools which create threads whenever and certain stream actions such as find
first will short circuit and may never get past reading the first child.
Mosaic exists because I needed to bulk load fast and perform multiple loads
after the bulk loads ie MRMW which Mosaic can do/spoof because it extends
Transactional with tryBegin(ReadWrite). Also we needed to access TDBs from
multiple JVMs because... (this one's another discussion too).

There was a PR but work got in the way of me testing with sufficient data
to stress it. It's now being stressed. Ideally I'd like to provide Mosaic
as a separate group eg jena-mosaic, which takes the load off maintaining
yet another add on.

Back on thread topic IMHO splitting the bulk load is the way to go as you
can always use service in your sparql plus manipulating a 250GB+ file is a
PITA!!! ;-)

On 12 Dec 2017 21:52, "ajs6f"  wrote:

That's not what Mosaic is doing at all. I'll leave it to Dick to explain
after this, because I am not the expert here, he is, but it's federating
multiple datasets so that they appear as one to SPARQL. It's got nothing to
do with individual graphs within a dataset.

ajs6f

> On Dec 12, 2017, at 4:36 PM, Laura Morales  wrote:
>
>> He can correct me as needed, but it seems that Dick is using (and
getting great results from)
>> an extension to Jena ("Mosaic") that federates different datasets (in
this cases from
>> independent TDB instances) and runs queries over them in parallel. We've
had some discussions
>> (all the way to a PR: https://github.com/apache/jena/pull/233) about
getting Mosaic into Jena's
>> codebase, but we haven't quite managed to do it. I would love to move
that process forward.
>
>
> I think his approach of splitting and running multiple tdbloaders works
if every TDB is loaded into the default graph (using
tdb:unionDefaultGraph). However I'm not sure if I want to maintain graph
labels. Is there any way to tell Jena that one particular graph is
"composed" of more than one TDB store? For example if I split Wikidata into
smaller stores of 100M triples each, I could "SELECT FROM "
instead of "SELECT FROM  
 ..."


Re: Avoid exception In the middle of an alloc-write

2017-12-12 Thread Dick Murray
We "hand" a transaction around using a ThreadProxy, which is basically a
wrapper around an ExecutorService which does one thing at a time. You
create it then give it to one or more threads which submit things to do and
it returns Future's. We extend it to implement Transactional so it works
with Txn. We require it because we use custom ForkJoinPool's to parallel
stream which create threads as and when which plays havoc with Transaction
control!

On 12 Dec 2017 19:51, "George News"  wrote:

On 2017-12-12 18:43, ajs6f wrote:
>>> Is there any option to link a transaction pointer to a class?
>>
>> Sorry - don't understand that question.
>
> If this means, "Is there a way to have a reference to a transaction
> that I can hand around?" then the answer is generally, no,
> transactions in Jena are normally thread-local and internal to the
> class from which the transaction was opened. I believe TDB2 takes
> some steps towards going beyond that, but generally, no.

this was what I was suggesting. Sorry for my English ;)

> ajs6f
>
>> On Dec 12, 2017, at 12:39 PM, Andy Seaborne 
>> wrote:
>>
>>
>>
>> On 12/12/17 10:11, George News wrote:
>>> On 2017-12-12 10:45, Andy Seaborne wrote:


 On 11/12/17 09:38, George News wrote:
> Hi,
>
> I'm facing the exception that I include below. I guess this
> is because I'm not properly opening a transaction or so.

 Yes - and also not using the datasets MRSW lock (multiple
 reader / single writer).

 Concurrent access must be controlled by a transaction or the
 datasets lock.  Ideally, transactions.

>
> Let's try to explain a bit to guess if this is the problem: -
> I have multiple graphs which I merge using MultiUnion - I
> generate the MultiUnion in one transaction, but the use of
> the joined graph is done in another transaction.

 Is the MultiUnion over graphs in the same dataset?
>>> Yes. (new question: is it possible to merge graphs from different
>>> datasets? Are they copied or just referenced?)
>>
>> MultiUnion is bunch of references and one graph distinguished for
>> update.
>>
>>> Actually I'm now checking in the code and I have multiple
>>> read-transactions, one inside another: 1 Read Transaction for
>>> SPARQL Select execution (using dataset.begin()) 1.1 Read
>>> transaction fro creating a big multiunion (using
>>> Txn.calculateRead()) 1.1.1 Read transaction for creating the
>>> multiunion (using Txn.calculateRead()) 1.1.2 Read Transaction for
>>> creating another multiunion (using Txn.calculateRead())
>>
>> Txn does cope with nesting but it's not nested transactions - it's
>> within the outer transaction.
>>
>>> 2 Do some stuff over the resultset 3 Close main one with if
>>> (dataset.isInTransaction()) { // Maybe it's better to use
>>> abort() // but as it is a read transaction // I think it doesn't
>>> matter dataset.end();
>>
>> If that is withing a Txn then its bad.
>>
>> Txn does the trasnaction management of begin-commit/abort-end.
>>
>>> } Now I'm thinking that maybe this 3) is closing something on the
>>> dataset that could be writing?
>>
>>
>>> Is there any option to link a transaction pointer to a class?
>>
>> Sorry - don't understand that question.
>>
>> Andy
>>
>>> Regards, Jorge
>>>
> - I use a single static final Dataset from
> TDBFactory.createDataset(TRIPLE_STORE_PATH); - Read and write
> operations are done in different threads, so maybe we have
> started a join for read and in parallel we are writing on one
> of the graphs included in the union.
>
>
> Any hint is welcome.
>
> Regards, Jorge

 Andy

>
>
> org.apache.jena.tdb.base.file.FileException: In the middle of
> an alloc-write at
> org.apache.jena.tdb.base.objectfile.ObjectFileStorage.
read(ObjectFileStorage.java:311)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> org.apache.jena.tdb.base.objectfile.ObjectFileWrapper.
read(ObjectFileWrapper.java:57)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> org.apache.jena.tdb.lib.NodeLib.fetchDecode(NodeLib.java:78)
> ~[jena-tdb-3.5.0.jar:3.5.0] at
> org.apache.jena.tdb.store.nodetable.NodeTableNative.readNodeFromTable(
NodeTableNative.java:186)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> org.apache.jena.tdb.store.nodetable.NodeTableNative._
retrieveNodeByNodeId(NodeTableNative.java:111)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> org.apache.jena.tdb.store.nodetable.NodeTableNative.getNodeForNodeId(
NodeTableNative.java:70)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> org.apache.jena.tdb.store.nodetable.NodeTableCache._
retrieveNodeByNodeId(NodeTableCache.java:128)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(
NodeTableCache.java:82)
>
>
~[jena-tdb-3.5.0.jar:3.5.0]
> at
> 

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
tdbloader2

For anyone still following this thread ;-)

latest-truthy supposedly contains just the verified facts, this is
Wikipedia...

latest-truthy is unsorted and contains duplicates, running sort then uniq
yields 61K+ duplicates ranging from 2 to 100+. Running sort takes a while!
Whilst it's not going to reduce the line count hugely (-3M lines) it's
worth considering when doing any import.

Fastest elapsed load TPS currently is ~405K (got there Andy) which was
achieved by splitting the file into 200M line files using split, running
four concurrent loads into four TDBs, each TDB on a separate 5400rpm 6G
drive, tdbloader2 script was hacked to run sort parallel 6 buffer 8G
temporary appropriate drive, repeat 3 times to give 12 TDB instances, query
via my Mosaic extension. I'll up the file size until the drive saturates
and stalls then drop it back and run it concurrently as the stall appears
to occur on the drive write. Currently perform the index data-triple sort
in parallel but write the indexes sequentially. The drives were stolen from
some old laptops so not exactly bling hardware.

On the subject of performance it's possible to cascade the split if you
have enough drives, split file in half and when the second file is created
split the first one in half and so on, uses inotify in a script.

While I aim to "load" truthy in under an hour that won't account for
getting the file, uncompressing the file (non parallel bzip2!!!), splitting
file, etc, but for marketing purposes who cares... ;-)



On 11 Dec 2017 18:43, "Laura Morales" <laure...@mail.com> wrote:

Did you run your Threadripper test using tdbloader, tdbloader2, or
tdb2.tdbloader?

@Andy where can I find a description of TDB1/2 binary format (how stuff is
stored in the files)?



Sent: Monday, December 11, 2017 at 11:31 AM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
Inline...

On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote:

> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or higher?
>

2133


> > 3 x M.2 Samsung 960 EVO
>
> Are these PCI-e disks? Or SATA? Also, what size and configuration?


PCIe Turbo


> > Is it possible to split the index files into separate folders?
> > Or sym link the files, if I run the data phase, sym link, then run the
> index phase?
>
> What would you gain from this?
>

n index files need to be written so split the load across multiple devices,
be that cores/controllers/storage. Potentially use a fast/expensive device
to perform the load and copy the files over to a production grade device.
Load device would have no redundancy as who cares if it throws a drive?
Production devices are redundant as 5 9's requirement.


>
> > 172K/sec 3h45m for truthy.
>
> It still feels slow considering that you throw such a powerful machine to
> it, but it's very interesting nonetheless! What I think after these tests,
> is that the larger impact here is given by the M.2 disks


Its also got 2 x SATAIII 6G drives and the load time doesn't increase by
much using these. There's a fundamental limit at which degradation occurs
as eventually stuff has to be swapped or committed which then cascades into
stalls. As an ex DBA bulk loads always involved, dropping or disabling
indexes, running overnight so users were asleep, building indexes, updating
stats, present DB in TP mode to make users happy! Things have moved on but
the same problems exists.


> , and perhaps to a smaller scale by the DDR4 modules. When I tested with a
> xeon+ddr3-1600, it didn't seem to make any difference. It would be
> interesting to test with a more "mid-range setup" (iCore/xeon + DDR3) and
> M.2 disks. Is this something that you can try as well?
>

IMHO it's not, our SLA equates to 50K/sec or 180M/hr quads an hour, so
anything over this a bonus. But we don't work on getting 500M quads into a
store at 150K/sec because this will eventually hit a ceiling. We work on
getting concurrent 500M quads into stores at 75K/sec. Production
environments are a completely different beast to having fun with a test
setup.

Consider the simplified steps involved in getting a single quad into a
store (please correct me Andy);

Read quad from source.
Verify GSPO lexical and type.
Check GSPO for uniqueness (read and compare) possibly x4 write to node->id
lookup.
Write indexes.
Repeat n times.

Understand binary format and tweak appropriately for tdbloader2 ;-)

Broadly speaking you can affect the overall time and the elapsed time. What
we refer to as the fast or clever problem. Simplistically, reduce the
overall by loading more per second and reduce the elapsed time by loading
more concurrently. I prefer going after the elapsed time with the divide
and conquer approach bec

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Understand, I'm running sort and uniq on truthy out of interest...

On 12 December 2017 at 10:31, Andy Seaborne <a...@apache.org> wrote:

>
>
> On 12/12/17 10:06, Dick Murray wrote:
> ...
>
>> As an aside there are duplicate entries in the data-triples.tmp file, is
>> this by design? if you sort data-triples.tmp | uniq > it returns a smaller
>> file and I've checked visually and there are duplicate entries...
>>
> ...
>
> It's expected.
>
> data-triples.tmp is a stream of triples from the parser.  If the data
> contains duplicates (in syntax), data-triples.tmp contain duplicates.
>
> Andy
>


Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Similar here.

I hacked (i.e. no checking/setup/params) the data/index scripts to create
s, p, o folders on soft linked three separate devices and moved in the
respective.dat and .idn files, hard linked back to the data-triples.tmp.
and ran the three triple indexes in parallel. sort was parallel 8 and
buffer 8GB. It built the three indexes in the time taken to build one.

As an aside there are duplicate entries in the data-triples.tmp file, is
this by design? if you sort data-triples.tmp | uniq > it returns a smaller
file and I've checked visually and there are duplicate entries...

I'll tidy the script and make it available if anyone wants to perform a
tweaked load, only really useful for large datasets.

On 11 December 2017 at 15:32, Andy Seaborne <a...@apache.org> wrote:

> This is for the large amount of temporary space that tdbloader2 uses?
>
> I got "latest-all" to load but I had to do some things with tdbloader2 to
> work with a compresses data-triples.tmp.gz and also have sort write
> comprssed temporary files (I messed up a bit and set the gzip compression
> too high so it slowed things down).
>
> There are some small problems with tdbloader2 with complex --sort-args (it
> only handles one single arg/value correctly).  My main trick was to put in
> a script for "sort" that had the required settings built-in. I wanted to
> set --compress, -T and the buffer size.
>
> On 10/12/17 21:18, Dick Murray wrote:
>
>> Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO,
>> 172K/sec 3h45m for truthy.
>>
>> Is it possible to split the index files into separate folders?
>>
>
> Not built-in.  Symbolic links will work.
>
> I'm keen on symbolic links here because built-in support would hard to
> keep all cases covered.
>
>
>> Or sym link the files, if I run the data phase, sym link, then run the
>> index phase?
>>
>
> Symbolic links will work.
>
> "sort" can be configured to use a temporary folder as well.
>
> The only place symbolic links will not work is for data-triples.tmp. It
> must not exist at all - we ought to change that to make it OK to have a
> zero-length file in place so it can be redirected ahead of time.
>
> Andy
>
>
>
>> Point me in the right direction and I'll extend the TDB file open code.
>>
>> Dick
>>
>>
>> On 7 Dec 2017 22:21, "Andy Seaborne" <a...@apache.org> wrote:
>>
>>
>>
>> On 07/12/17 19:01, Laura Morales wrote:
>>
>> Thank you a lot Andy, very informative (special thanks for specifying the
>>> hardware).
>>> For anybody reading this, I'd like to highlight the fact that the data
>>> source is "latest-truthy" and not "latest-all".
>>>  From what I understand, truthy leaves out a lot of data (50% ??) and
>>> "all"
>>> is more than 4 billion triples.
>>>
>>>
>> 4,787,194,669 Triples
>>
>> Dick reported figures for truthy as well.
>>
>> I used a *16G* machine, and it is a portable with all it's memory
>> architecture tradeoffs.
>>
>> "all" is running ATM - it will be much slower due to RAM needs of
>> tdbloader2 for the data phase.  Not sure the figures will mean anything
>> for
>> you.
>>
>> I'd need a machine with (guess) 32G RAM which is still a small server
>> these
>> days.
>>
>> (A similar tree builder technique could be applied to the node index and
>> reduce the max RAM needs but - hey, ho - that's free software for you.)
>>
>>  Andy
>>
>>


Re: Report on loading wikidata

2017-12-11 Thread Dick Murray
Inline...

On 10 December 2017 at 23:03, Laura Morales  wrote:

> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or higher?
>

2133


> > 3 x M.2 Samsung 960 EVO
>
> Are these PCI-e disks? Or SATA? Also, what size and configuration?


PCIe Turbo


> > Is it possible to split the index files into separate folders?
> > Or sym link the files, if I run the data phase, sym link, then run the
> index phase?
>
> What would you gain from this?
>

n index files need to be written so split the load across multiple devices,
be that cores/controllers/storage. Potentially use a fast/expensive device
to perform the load and copy the files over to a production grade device.
Load device would have no redundancy as who cares if it throws a drive?
Production devices are redundant as 5 9's requirement.


>
> > 172K/sec 3h45m for truthy.
>
> It still feels slow considering that you throw such a powerful machine to
> it, but it's very interesting nonetheless! What I think after these tests,
> is that the larger impact here is given by the M.2 disks


Its also got 2 x SATAIII 6G drives and the load time doesn't increase by
much using these. There's a fundamental limit at which degradation occurs
as eventually stuff has to be swapped or committed which then cascades into
stalls. As an ex DBA bulk loads always involved, dropping or disabling
indexes, running overnight so users were asleep, building indexes, updating
stats, present DB in TP mode to make users happy! Things have moved on but
the same problems exists.


> , and perhaps to a smaller scale by the DDR4 modules. When I tested with a
> xeon+ddr3-1600, it didn't seem to make any difference. It would be
> interesting to test with a more "mid-range setup" (iCore/xeon + DDR3) and
> M.2 disks. Is this something that you can try as well?
>

IMHO it's not, our SLA equates to 50K/sec or 180M/hr quads an hour, so
anything over this a bonus. But we don't work on getting 500M quads into a
store at 150K/sec because this will eventually hit a ceiling. We work on
getting concurrent 500M quads into stores at 75K/sec. Production
environments are a completely different beast to having fun with a test
setup.

Consider the simplified steps involved in getting a single quad into a
store (please correct me Andy);

Read quad from source.
Verify GSPO lexical and type.
Check GSPO for uniqueness (read and compare) possibly x4 write to node->id
lookup.
Write indexes.
Repeat n times.

Understand binary format and tweak appropriately for tdbloader2 ;-)

Broadly speaking you can affect the overall time and the elapsed time. What
we refer to as the fast or clever problem. Simplistically, reduce the
overall by loading more per second and reduce the elapsed time by loading
more concurrently. I prefer going after the elapsed time with the divide
and conquer approach because it yields more scalable results. This is why
we run multiple stores (not just TDB) and query over them. This in itself
is a trade because we need to use distinct when merging streams which can
be RAM intensive. And we're really tight on the number of quads you can
return! :-)


Re: Report on loading wikidata

2017-12-10 Thread Dick Murray
Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO,
172K/sec 3h45m for truthy.

Is it possible to split the index files into separate folders?

Or sym link the files, if I run the data phase, sym link, then run the
index phase?

Point me in the right direction and I'll extend the TDB file open code.

Dick


On 7 Dec 2017 22:21, "Andy Seaborne"  wrote:



On 07/12/17 19:01, Laura Morales wrote:

> Thank you a lot Andy, very informative (special thanks for specifying the
> hardware).
> For anybody reading this, I'd like to highlight the fact that the data
> source is "latest-truthy" and not "latest-all".
> From what I understand, truthy leaves out a lot of data (50% ??) and "all"
> is more than 4 billion triples.
>

4,787,194,669 Triples

Dick reported figures for truthy as well.

I used a *16G* machine, and it is a portable with all it's memory
architecture tradeoffs.

"all" is running ATM - it will be much slower due to RAM needs of
tdbloader2 for the data phase.  Not sure the figures will mean anything for
you.

I'd need a machine with (guess) 32G RAM which is still a small server these
days.

(A similar tree builder technique could be applied to the node index and
reduce the max RAM needs but - hey, ho - that's free software for you.)

Andy


Re: TDB Loader 2 and TDB2 Loader

2017-12-06 Thread Dick Murray
Thank you! I was way off...

If it's already sorted that step would be quick...

On 6 Dec 2017 19:54, "ajs6f" <aj...@apache.org> wrote:

> https://github.com/apache/jena/blob/master/apache-jena/
> bin/tdbloader2index#L363
>
> tdbloader2 calls tdbloader2index and tdbloader2data.
>
> ajs6f
>
> > On Dec 6, 2017, at 2:50 PM, Dick Murray <dandh...@gmail.com> wrote:
> >
> > TDB Loader 2, where does it call the Unix sort please? I'm obviously
> > looking too hard!
> >
> > TDB2 Loader does a simple .add(Quad)? I'm not missing something?
> >
> > Dick.
>
>


TDB Loader 2 and TDB2 Loader

2017-12-06 Thread Dick Murray
TDB Loader 2, where does it call the Unix sort please? I'm obviously
looking too hard!

TDB2 Loader does a simple .add(Quad)? I'm not missing something?

Dick.


Re: tdb2.tdbloader performance

2017-12-02 Thread Dick Murray
Hello.

On 2 Dec 2017 8:55 pm, "Andy Seaborne"  wrote:


Short story I used the following "reasonable" device
>
>  Dell M3800
>  Fedora 27
>  16GB SODIMM DDR3 Synchronous 1600 MHz
>  CPU cache L1/256KB,L2/1MB,L3/6MB
>  Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads
>
> to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
> disk and;
>
> @800%60K/Sec
> @100%40K/Sec
> @50%20K/Sec
>
> The full source file contains 2.2G of triples in 10GB bz2 which
> decompresses to 250GB nt, which I split into 10M triple chunks and used the
> first one to test.
>

Which tdb loader?


TDB2


For TDB1, the two loader behave very differently.

I loaded truthy, 2.199 billion triples, on a 16G Dell XPS with SSD in 8
hours (76K triples/s) using TDB1 tdbloader2.

I'll write it up soon.


Loaded truthy on the server in 9 hours using raid 5 with 10 10k 1TB SAS.
Loaded 4 truthy's concurrently in 9.5 hours. I think that's the biggest
concurrent source the server has handled. Fans work!



Check with Andy but I think it's limited by CPU, which is why my 24 core (4
> x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
> performance hit.
>

The limit at scale is the I/O handling and disk cache. 128G RAM gives a
better disk cache and that server machine probably has better I/O.  It's
big enough to fit one whole index (if all RAM is available - and that
depends on the swappiness setting which should be set to zero ideally).

CPU is a limit for a while but you'll see the load speed slows down so it
is not purely CPU as the limit. (As the indexes are 200-way trees, they
don't get very deep.)

tdbloader (loader1) does one index at a time so that the I/O is
constrained, unlike simply adding triples to all 3 indexes together (which
is what TDB2 loader does currently).

loader1 degrades at large scale due to random I/O write patterns on
secondary indexes.  Hence an SSD makes a big difference.

loader2 (which has high overhead) avoids the problems and only write
indexes from sorted input so no random access to the indexes.  An SSD makes
less difference.


I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
> next few days and I will try and test against it.
>
> I haven't run the full import because a: i'm guessing the resulting TDB2
> will be "large" b: my servers are currently importing other "large"
> TDB2's!!!
>

The TDB2 database for a single graph will be same size as TDB1 using
tdbloader (not tdbloader2).


> Long story follows...
>




Re: tdb2.tdbloader performance

2017-12-01 Thread Dick Murray
Hi.

Sorry for the delay :-)

Short story I used the following "reasonable" device

Dell M3800
Fedora 27
16GB SODIMM DDR3 Synchronous 1600 MHz
CPU cache L1/256KB,L2/1MB,L3/6MB
Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads

to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
disk and;

@800%60K/Sec
@100%40K/Sec
@50%20K/Sec

The full source file contains 2.2G of triples in 10GB bz2 which
decompresses to 250GB nt, which I split into 10M triple chunks and used the
first one to test.

Check with Andy but I think it's limited by CPU, which is why my 24 core (4
x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
performance hit.

I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
next few days and I will try and test against it.

I haven't run the full import because a: i'm guessing the resulting TDB2
will be "large" b: my servers are currently importing other "large"
TDB2's!!!

Long story follows...

decompress the file;

pbzip2 -dv -p4 -m1024 latest-truthy.nt.bz2
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
By: Jeff Gilchrist [http://compression.ca]
Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com]
Uses libbzip2 by Julian Seward

 # CPUs: 4
 Maximum Memory: 1024 MB
 Ignore Trailing Garbage: off
---
 File #: 1 of 1
 Input Name: latest-truthy.nt.bz2
Output Name: latest-truthy.nt

 BWT Block Size: 900k
 Input Size: 9965955258 bytes
Decompressing data...
Output Size: 277563574685 bytes
---

 Wall Clock: 5871.550948 seconds

count the lines;

wc -l latest-truthy.nt
2199382887 latest-truthy.nt

Just short of 2200M...

split the file into 10M chunks;

split -d -l 10485760 -a 3 --verbose latest-truthy.nt latest-truthy.nt.
creating file 'latest-truthy.nt.000'
creating file 'latest-truthy.nt.001'
creating file 'latest-truthy.nt.002'
creating file 'latest-truthy.nt.003'
creating file 'latest-truthy.nt.004'
creating file 'latest-truthy.nt.005'
...

Restart!

sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt

ps aux | grep tdb2
root  3358  0.0  0.0 222844  5756 pts/0S+   19:22   0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root  3359  0.0  0.0   4500   776 pts/0S+   19:22   0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root  3360  0.0  0.0 120304  3288 pts/0S+   19:22   0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root  3361  4.9  0.0   450092 pts/0S<+  19:22   0:05 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root  3366 95.7 14.8 7866116 2418768 pts/0 Sl+  19:22   1:42 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick  3477  0.0  0.0 119728   972 pts/1S+   19:24   0:00 grep
--color=auto tdb2

Notice PID 3366 is -Xmx2G default.

19:26:49 INFO  TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 247.28s (Avg: 42,404)

After the first pass there is no read from the 1TB source as the OS has
cached the 1.2G source.

19:33:50 INFO  TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 245.70s (Avg: 42,677)

export JVM_ARGS="-Xmx4G" i.e. increase the max heap and help the GC

sudo ps aux | grep tdb2
root  4317  0.0  0.0 222848  6236 pts/0S+   19:35   0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root  4321  0.0  0.0   4500   924 pts/0S+   19:35   0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root  4322  0.0  0.0 120304  3356 pts/0S+   19:35   0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root  4323  4.8  0.0   450088 pts/0S<+  19:35   0:09 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root  4328 94.8 18.5 8406788 3036188 pts/0 Sl+  19:35   3:01 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick  4594  0.0  0.0 119728  1024 pts/1S+   19:38   0:00 grep
--color=auto tdb2

At 800K PID was 3GB and peaked at 3.4GB just prior to completion.

19:39:23 INFO  TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 247.65s (Avg: 42,340)

Throw all CPU resources at it i.e. 800

sudo 

Re: tdb2.tdbloader performance

2017-11-28 Thread Dick Murray
LOL, there's lots of things where I'd like to "move the problem elsewhere".

I've achieved concurrent 120K on the server hardware but it depends on the
input. There's another recent Jena thread regarding sizing and that's tied
up with what's in the input. I see the same thing with loading data, some
files fly others seem to drag and it's not just the size. What the server
hardware does do is allow me to run multiple processes and average 60K.
Also up to a certain size I have an overclocked AMD (4.5Ghz) and it will
outperform everything until it hits its cache limit.

We tend towards running multiple TDB's and present them as one, a legacy of
overcoming the one writer in TDB1. This brings it's own issues such as
distinct being high cost which we mitigate with a few tricks.

On the minefield subject of hardware, do you have DDR3 or DDR4? What
chipset is driving it because Haswell’s dual-channel memory controller is
going to have a hard time keeping up with the quad-channel memory
controllers on Ivy Bridge-E and Haswell-E. And yes Corsair quote 47GB/s for
DDR4, but you still need to write that somewhere and a M.2 a PCI-E 2.0 x4
at 1.6GB/s is almost 3x the througput of SATAIII at 600MB/s, PCI-E 3.0 x4
is 3.9GB/s, plus you now have Optane or 3D XPoint depending on what sounds
better

What files are you trying to import and i'll run them through?

Regards Dick

On 28 November 2017 at 15:30, Laura Morales  wrote:

> > Eventually something will give and you'll get a wait as something is
> spilled to something, ie cache to physical drive.
> > Also different settings suit different work loads. I have a number of
> +128GB units configured differently depending on what they need to do. The
> ETL setting only gives Java 8GB but the OS will consume close to 90GB
> virtual for the process as it basically dumps into file cache. At some
> point though that cache is written out to noon volatile storage. As the
> units have 24 cores I can actually run close to 12 processes before things
> start to effect each other. If you consider server class hardware there's a
> lot of thought to cache levels and how they cascade.
> > Switch the SATA for M.2 and you'll move the issue somewhere else...
>
> Well yeah, but having a problem at 10K triples/seconds is not the same
> problem as 1M triples/seconds. I'll gladly "move the problem elsewhere" if
> I knew how to get to 1M triples/seconds.
> Moving from SATA to M.2 I don't know if it's worth the trouble (and money)
> given that on my computer running from SATA3 disks or RAMdisk doesn't seem
> like it's making any difference. And RAM is much faster than M.2 too.
> Just out of curiosity, how many "AVG triples/seconds" can you get with
> your server-class hardware when converting a .nt to TDB2 using
> tdb2.tdbloader?
>


Jena 3.2.0-rc1 issue

2017-05-15 Thread Dick Murray
This is probably me but...

I've got a collection of import errors in my Jena 3.2.0-rc1 fork, the
common issue being the import prefix "org.apache.jena.ext"...

i.e. import org.apache.jena.ext.com.google.common.cache.Cache ; in jena-arq
FactoryRDFCaching

I've checked the github apache jena repository and it has these imports...

Am I missing something?

D


Re: Materialize query

2017-04-26 Thread Dick Murray
I've seen this type of statement in regard to Oracle whereby a materialized
query is disk based and updated periodically based on the query. It's
useful in BI where you don't require the latest data. As to RDF the closest
I can parallel is persisting inference (think RDFS subclass of i.e. A -> B
-> C) so as not to incur the overhead at query time. But others would call
that pre computing or caching or anyone of a number of similar terms...

On 26 April 2017 at 13:30, javed khan  wrote:

> Lorenz, I have seen it in a statement like " Materialize queries are used
> to reduce the number of necessary joins and processing time".
>
> On Wed, Apr 26, 2017 at 8:52 AM, Lorenz B. <
> buehm...@informatik.uni-leipzig.de> wrote:
>
> > Please understand that the term "materialised SPARQL queries" is not a
> > common one, thus, probably nobody will be able to answer your question.
> >
> > So let me ask you, WHAT is a materialised SPARQL query and WHERE have
> > you see this expression?
> >
> > > Hello
> > >
> > > What is materialized SPARQL queries and how it differs from other
> > queries?
> > >
> > > Regards
> > >
> > --
> > Lorenz Bühmann
> > AKSW group, University of Leipzig
> > Group: http://aksw.org - semantic web research center
> >
> >
>


Re: Predicates with no vocabulary

2017-04-12 Thread Dick Murray
It is for this reason that I use  and as a nod to my Cisco
engineer days and example.org... :-)

As Martynas Jusevičius said give it a little thought.

On 12 April 2017 at 17:37, Martynas Jusevičius 
wrote:

> It would not be an error as long it is a valid URI.
>
> Conceptually non-HTTP URIs go against Linked Data principles because
> normally they cannot be dereferenced.
>
> Therefore it makes sense to give it a little thought and choose an
> http:// namespace that you control.
>
> On Wed, Apr 12, 2017 at 6:31 PM, Laura Morales  wrote:
> >> I use "urn:ex:..." in a lot of my test code (short for "urn:example:").
> >>
> >> Then the predicate is "urn:ex:time/now" or "urn:ex:time/duration" or
> >> whatever you need...
> >
> > would it be an error (perhaps conceptually) to use "ex:...", essentially
> removing the "urn:" scheme?
>


Re: Predicates with no vocabulary

2017-04-12 Thread Dick Murray
I use "urn:ex:..." in a lot of my test code (short for "urn:example:").

Then the predicate is "urn:ex:time/now" or "urn:ex:time/duration" or
whatever you need...

On 12 April 2017 at 09:49, Laura Morales  wrote:

> > The question is a bit unclear. If there is no existing vocabulary that
> > you can resp. want to reuse, then you have to use your own vocabulary
> > which basically just means to use your own URIs for the predicates.
>
> Right, so let's say I don't want to define any new vocabulary, but I just
> want to use some predicates. For example a predicate called "predicate1"
> and "predicate2". These are not meant to be shared, I use them for whatever
> reason and I take full responsibility to shooting myself in the foot. Is
> there any "catch-all" or "default/undefined" vocabulary that I can use? I
> mean something like a default vocabulary that parses as valid URIs, but
> whose meaning is undefined (= the interpretation is left to the user)?
> Something like "  " and "
>  "... I wonder if I should use "
> <_:predicate1> " but I'm not sure?!
>


Re: Binary protocol

2017-04-05 Thread Dick Murray
I think that worked;

wants to merge 1 commit into apache:master from dick-twocows:master

if so I'll compile the Thrift file and commit that too...

On 5 April 2017 at 19:28, Andy Seaborne <a...@apache.org> wrote:

> Should be - let's try it!
>
> Andy
>
>
> On 05/04/17 19:25, Dick Murray wrote:
>
>> Ok, I'm forked from apache/jena on GitHub at
>> https://github.com/dick-twocows/jena.git, is it sufficient for me to
>> issue
>> a PR from here?
>>
>> I'm hoping it's fairly agnostic... :-)
>>
>> The transform code is not in the Jena fork because, I don't really know
>> why! Concept is it's not a TTL but a CSV so use the appropriate handler to
>> transform and load.
>>
>> On 5 April 2017 at 12:54, Andy Seaborne <a...@apache.org> wrote:
>>
>>
>>>
>>> On 04/04/17 20:26, Dick Murray wrote:
>>>
>>> I'd be happy to supply the current code we have, just need to get the
>>>> current project delivered (classic spec delivered after the code due
>>>> date!)
>>>> Will tidy and do a pull request if anyone is interested..?
>>>>
>>>>
>>> Interested.
>>>
>>> Because it was written for your needs, I'd guess it can be a basis for
>>> general binary.  So a PR sooner would be my preference, less worrying
>>> about
>>> fit to Jena at this stage.
>>>
>>> Without knowing what you have ...
>>>
>>> What would be good is to add to the existing collection of
>>> readers/writers
>>> for RDF+SPARQL results and drive off the MIME type.
>>>
>>> And a "remote graph".find(G,S,P,O) for those bnode things.
>>>
>>>
>>> A PR is an easy way to view the code even if it isn't ready for
>>> inclusion.  A PR can be evolved as discussion happens.
>>>
>>>
>>> We also check the file type and possibly transform before the load. But
>>>
>>>> this is a simple map lookup so shouldn't cause any issues.
>>>>
>>>> Our bulk load isn't RDF patch, it's a load of one or more data files
>>>> into
>>>> a
>>>> new graph possibly with a transform performed at some point in the
>>>> future,
>>>> which you check by querying the load ID...
>>>>
>>>>
>>> I'm interested in seeing the transform process.
>>>
>>> The base line is a binary version of all the current Fuseki interactions
>>> -
>>> to me, a transform framework is a separate, related thing. (Break the
>>> problem into smaller steps in order for "volunteers" to make progress
>>> ...)
>>>
>>> Andy
>>>
>>>
>>> Dick
>>>>
>>>>
>>>> On 4 Apr 2017 8:15 pm, "Andy Seaborne" <a...@apache.org> wrote:
>>>>
>>>>
>>>>
>>>> On 04/04/17 19:02, Dick Murray wrote:
>>>>
>>>> Slightly lateral on the topic but we use a Thrift endpoint compiled
>>>>
>>>>> against
>>>>> Jena to allow multiple languages to use Jena. Think interface
>>>>> supporting
>>>>> sparql, sparul and bulk load...
>>>>>
>>>>>
>>>>> I'd like to put in binary versions of the protocols behind a
>>>> RDFConnection.
>>>>
>>>> Bulk load would be RDF patch -- "bulk changes".
>>>>
>>>> Andy
>>>>
>>>>
>>>>
>>>> On 3 Apr 2017 6:36 pm, "Martynas Jusevičius" <marty...@graphity.org>
>>>>
>>>>> wrote:
>>>>>
>>>>> By using uniform protocols such as HTTP and SPARQL (over HTTP), you
>>>>>
>>>>> decouple server implementation from client implementation.
>>>>>>
>>>>>> You can execute SPARQL commands on Fuseki using PHP, C#, JavaScript or
>>>>>> any other language. But that involves networking.
>>>>>>
>>>>>> You can only use the Jena API from Java (and then some JVM-compatible
>>>>>> languages).
>>>>>>
>>>>>> On Mon, Apr 3, 2017 at 5:02 PM, javed khan <javedbtk...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thank you Lorenz, I have read that website but unfortunately did not
>>>>>> get
>>>>>>
>>>>>>> the concept. Let me try to read it again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 3, 2017 at 4:35 PM, Lorenz Buehmann <
>>>>>>> buehm...@informatik.uni-leipzig.de> wrote:
>>>>>>>
>>>>>>> Javed ...
>>>>>>>
>>>>>>>
>>>>>>>> I'll simply cite the "slogan" from the web page [1] and recommend to
>>>>>>>> read [2]
>>>>>>>>
>>>>>>>> "Fuseki: serving RDF data over HTTP"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://jena.apache.org/documentation/serving_data/
>>>>>>>>
>>>>>>>> [2] https://jena.apache.org/documentation/fuseki2/
>>>>>>>>
>>>>>>>>
>>>>>>>> On 03.04.2017 14:54, javed khan wrote:
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why we need fuseki server in semantic web applications. We can run
>>>>>>>>>
>>>>>>>>> SPARQL
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> queries without it, like we do using Jena syntax.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>


Re: Binary protocol

2017-04-05 Thread Dick Murray
Ok, I'm forked from apache/jena on GitHub at
https://github.com/dick-twocows/jena.git, is it sufficient for me to issue
a PR from here?

I'm hoping it's fairly agnostic... :-)

The transform code is not in the Jena fork because, I don't really know
why! Concept is it's not a TTL but a CSV so use the appropriate handler to
transform and load.

On 5 April 2017 at 12:54, Andy Seaborne <a...@apache.org> wrote:

>
>
> On 04/04/17 20:26, Dick Murray wrote:
>
>> I'd be happy to supply the current code we have, just need to get the
>> current project delivered (classic spec delivered after the code due
>> date!)
>> Will tidy and do a pull request if anyone is interested..?
>>
>
> Interested.
>
> Because it was written for your needs, I'd guess it can be a basis for
> general binary.  So a PR sooner would be my preference, less worrying about
> fit to Jena at this stage.
>
> Without knowing what you have ...
>
> What would be good is to add to the existing collection of readers/writers
> for RDF+SPARQL results and drive off the MIME type.
>
> And a "remote graph".find(G,S,P,O) for those bnode things.
>
>
> A PR is an easy way to view the code even if it isn't ready for
> inclusion.  A PR can be evolved as discussion happens.
>
>
> We also check the file type and possibly transform before the load. But
>> this is a simple map lookup so shouldn't cause any issues.
>>
>> Our bulk load isn't RDF patch, it's a load of one or more data files into
>> a
>> new graph possibly with a transform performed at some point in the future,
>> which you check by querying the load ID...
>>
>
> I'm interested in seeing the transform process.
>
> The base line is a binary version of all the current Fuseki interactions -
> to me, a transform framework is a separate, related thing. (Break the
> problem into smaller steps in order for "volunteers" to make progress ...)
>
> Andy
>
>
>> Dick
>>
>>
>> On 4 Apr 2017 8:15 pm, "Andy Seaborne" <a...@apache.org> wrote:
>>
>>
>>
>> On 04/04/17 19:02, Dick Murray wrote:
>>
>> Slightly lateral on the topic but we use a Thrift endpoint compiled
>>> against
>>> Jena to allow multiple languages to use Jena. Think interface supporting
>>> sparql, sparul and bulk load...
>>>
>>>
>> I'd like to put in binary versions of the protocols behind a
>> RDFConnection.
>>
>> Bulk load would be RDF patch -- "bulk changes".
>>
>> Andy
>>
>>
>>
>> On 3 Apr 2017 6:36 pm, "Martynas Jusevičius" <marty...@graphity.org>
>>> wrote:
>>>
>>> By using uniform protocols such as HTTP and SPARQL (over HTTP), you
>>>
>>>> decouple server implementation from client implementation.
>>>>
>>>> You can execute SPARQL commands on Fuseki using PHP, C#, JavaScript or
>>>> any other language. But that involves networking.
>>>>
>>>> You can only use the Jena API from Java (and then some JVM-compatible
>>>> languages).
>>>>
>>>> On Mon, Apr 3, 2017 at 5:02 PM, javed khan <javedbtk...@gmail.com>
>>>> wrote:
>>>>
>>>> Thank you Lorenz, I have read that website but unfortunately did not get
>>>>> the concept. Let me try to read it again.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 3, 2017 at 4:35 PM, Lorenz Buehmann <
>>>>> buehm...@informatik.uni-leipzig.de> wrote:
>>>>>
>>>>> Javed ...
>>>>>
>>>>>>
>>>>>> I'll simply cite the "slogan" from the web page [1] and recommend to
>>>>>> read [2]
>>>>>>
>>>>>> "Fuseki: serving RDF data over HTTP"
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] https://jena.apache.org/documentation/serving_data/
>>>>>>
>>>>>> [2] https://jena.apache.org/documentation/fuseki2/
>>>>>>
>>>>>>
>>>>>> On 03.04.2017 14:54, javed khan wrote:
>>>>>>
>>>>>> Hi
>>>>>>>
>>>>>>> Why we need fuseki server in semantic web applications. We can run
>>>>>>>
>>>>>>> SPARQL
>>>>>>
>>>>>
>>>> queries without it, like we do using Jena syntax.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>


Re: Why we need Fuseki

2017-04-04 Thread Dick Murray
I'd be happy to supply the current code we have, just need to get the
current project delivered (classic spec delivered after the code due date!)
Will tidy and do a pull request if anyone is interested..?

We also check the file type and possibly transform before the load. But
this is a simple map lookup so shouldn't cause any issues.

Our bulk load isn't RDF patch, it's a load of one or more data files into a
new graph possibly with a transform performed at some point in the future,
which you check by querying the load ID...

Dick


On 4 Apr 2017 8:15 pm, "Andy Seaborne" <a...@apache.org> wrote:



On 04/04/17 19:02, Dick Murray wrote:

> Slightly lateral on the topic but we use a Thrift endpoint compiled against
> Jena to allow multiple languages to use Jena. Think interface supporting
> sparql, sparul and bulk load...
>

I'd like to put in binary versions of the protocols behind a RDFConnection.

Bulk load would be RDF patch -- "bulk changes".

Andy



> On 3 Apr 2017 6:36 pm, "Martynas Jusevičius" <marty...@graphity.org>
> wrote:
>
> By using uniform protocols such as HTTP and SPARQL (over HTTP), you
>> decouple server implementation from client implementation.
>>
>> You can execute SPARQL commands on Fuseki using PHP, C#, JavaScript or
>> any other language. But that involves networking.
>>
>> You can only use the Jena API from Java (and then some JVM-compatible
>> languages).
>>
>> On Mon, Apr 3, 2017 at 5:02 PM, javed khan <javedbtk...@gmail.com> wrote:
>>
>>> Thank you Lorenz, I have read that website but unfortunately did not get
>>> the concept. Let me try to read it again.
>>>
>>>
>>>
>>> On Mon, Apr 3, 2017 at 4:35 PM, Lorenz Buehmann <
>>> buehm...@informatik.uni-leipzig.de> wrote:
>>>
>>> Javed ...
>>>>
>>>> I'll simply cite the "slogan" from the web page [1] and recommend to
>>>> read [2]
>>>>
>>>> "Fuseki: serving RDF data over HTTP"
>>>>
>>>>
>>>>
>>>> [1] https://jena.apache.org/documentation/serving_data/
>>>>
>>>> [2] https://jena.apache.org/documentation/fuseki2/
>>>>
>>>>
>>>> On 03.04.2017 14:54, javed khan wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Why we need fuseki server in semantic web applications. We can run
>>>>>
>>>> SPARQL
>>
>>> queries without it, like we do using Jena syntax.
>>>>>
>>>>>
>>>>
>>>>
>>
>


Re: Why we need Fuseki

2017-04-04 Thread Dick Murray
Slightly lateral on the topic but we use a Thrift endpoint compiled against
Jena to allow multiple languages to use Jena. Think interface supporting
sparql, sparul and bulk load...

On 3 Apr 2017 6:36 pm, "Martynas Jusevičius"  wrote:

> By using uniform protocols such as HTTP and SPARQL (over HTTP), you
> decouple server implementation from client implementation.
>
> You can execute SPARQL commands on Fuseki using PHP, C#, JavaScript or
> any other language. But that involves networking.
>
> You can only use the Jena API from Java (and then some JVM-compatible
> languages).
>
> On Mon, Apr 3, 2017 at 5:02 PM, javed khan  wrote:
> > Thank you Lorenz, I have read that website but unfortunately did not get
> > the concept. Let me try to read it again.
> >
> >
> >
> > On Mon, Apr 3, 2017 at 4:35 PM, Lorenz Buehmann <
> > buehm...@informatik.uni-leipzig.de> wrote:
> >
> >> Javed ...
> >>
> >> I'll simply cite the "slogan" from the web page [1] and recommend to
> >> read [2]
> >>
> >> "Fuseki: serving RDF data over HTTP"
> >>
> >>
> >>
> >> [1] https://jena.apache.org/documentation/serving_data/
> >>
> >> [2] https://jena.apache.org/documentation/fuseki2/
> >>
> >>
> >> On 03.04.2017 14:54, javed khan wrote:
> >> > Hi
> >> >
> >> > Why we need fuseki server in semantic web applications. We can run
> SPARQL
> >> > queries without it, like we do using Jena syntax.
> >> >
> >>
> >>
>


Re: Jena scalability

2017-03-26 Thread Dick Murray
On 26 Mar 2017 5:20 pm, "Laura Morales"  wrote:

- Is Jena a "native" store? Or does it use some other RDBMS/NoSQL backends?


It has memory, TDB and SDB (I'm not sure of the current state)

- Has anybody ever done tests/benchmarks to see how well Jena scales with
large datasets (billions or trillions of n-quads)?


We have several 650GB TDB and some Men instances at 128 GB. What queries
are being performed? How many graphs do you have? Are you just querying or
updating as well?

- Is it possible to start with a single machine, and later distribute the
database over multiple machines as the graph grows?


Not currently with TDB but i have code in production which aggregates
across multiple DatasetGraph's. We create a DatasetGraphMosaic and add
DatasetGraph's to it. TDB in other JVM's are supported via a Thrift based
proxy. This allows simple sparql, otherwise use the service command in your
query...


Re: Understanding DatasetGraph getLock() (DatasetGraphInMem throwing a curve ball)...

2017-03-24 Thread Dick Murray
Thanks for the background. I'll map to TDB and Mem and throw an UOE if
"another" DG is encountered.

Same here, I drew a blank on a Jena optimistic lock and try lock. So I've
created a LockMRAndMW (effectively lazy) which is used to control the
DatasetGraphDistributed i.e. no blocking via the begin(ReadWrite). Then
when the streams are called (e.g. find(...)) the actual DG's have the read
transaction started. Also a LockMRSWTry and LockMRPlusSWTry which wrap the
TDB and Mem lock semantics.

It was REALLY important for us that we don't block on the begin(ReadWrite)
call as we are currently aggregating 18 separate JVM TDB/Mem instances into
one DG (via a Thrift DG implementation). Specifically when we perform an
ETL we try each remote DG until we acquire a write lock then the quads are
loaded. This way we can support multiple writes as we effectively shard the
TDB. This way we reduced bulk ETL load times from the sum of all load times
to simplistically the longest load time (assuming we have enough shards...)

Internally the sharded DG's are only locked when they are touched.

The majority of DG's are TDB backed but the system recognises certain
"things" and will spin up a Mem backed DG in another JVM to perform adhoc
work then tear it down.

On 24 March 2017 at 11:41, A. Soroka <aj...@virginia.edu> wrote:

> The lock from getLock is always the same semantics for every impl--
> currently MRSW, with no expectation for changing. It's a kind of "system
> lock" to keep the internal state of that class consistent. That's distinct
> from the transactional semantics of a given impl. In some cases, the
> semantics happen to coincide, when the actual transactional semantics are
> also MRSW. But sometimes they don't (actually, I think DatasetGraphInMem is
> the only example where they don't right now, but I am myself tinkering with
> another example and I am confident that we will have more). When they
> don't, you need to rely on the impl to manage its own transactionality, via
> the methods for that purpose.  I'm not actually sure we have a good
> non-blocking method for your use right now. We have inTransaction(), but
> that's not too helpful here.
>
> But someone else can hopefully point to a technique that I am missing.
>
>
> ---
> A. Soroka
> The University of Virginia Library
>
> > On Mar 24, 2017, at 6:51 AM, Dick Murray <dandh...@gmail.com> wrote:
> >
> > Hi.
> >
> > Is there a way to get what Transactional a DatasetGraph is using and
> > specifically what Lock semantics are in force?
> >
> > As part of a distributed DatasetGraph implementation I have a
> > DatasetGraphTry wrapper which adds Boolean tryBegin(ReadWrite) and as the
> > name suggests it will try to lock the given DatasetGraph and return
> > immediately, i.e. not block. Internally if it acquires the lock it will
> > call the wrapped void begin(ReadWrite) which "should" not block. This is
> > useful because I can round robin the DatasetGraph's which constitute the
> > distribution without blocking. Especially useful as some of the
> > DatasetGraph's are running in other JVM's.
> >
> > Currently I've reverted the mapping to the DatasetGraph class (requires I
> > manually check the Jena code) but I'd like to understand why and possibly
> > make the code neater...
> >
> > To automate the wrapping I pulled the Lock via getLock() and used the
> class
> > to lookup the appropriate wrapper. But after digging I noticed that the
> > Lock from getLock() doesn't always match the Transactional locking
> > semantics.
> >
> > DatasetGraphInMem getLock() returns org.apache.jena.shared.LockMRSW but
> > internally its Transactional implementation is
> > using org.apache.jena.shared.LockMRPlusSW which is subtly different.
> This
> > is noticeable because getLock() isn't overridden but inherits from
> > DatasetGraphBase which declares LockMRSW.
> >
> > A TDB backed DatasetGraph masquerades as a;
> >
> > DatasetGraphTransaction
> >
> > DatasetGraphTrackActive
> >
> > DatasetGraphWrapper
> >
> > which wraps the DatasetGraphTDB
> >
> > DatasetGraphTripleQuads
> >
> > DatasetGraphBaseFind
> >
> > DatasetGraphBase where the getLock() returns
> >
> >
> >
> > INFO Thread[main,5,main] [class
> > org.apache.jena.sparql.core.mem.DatasetGraphInMemory]
> > INFO Thread[main,5,main] [class org.apache.jena.shared.LockMRSW]
> >
> > INFO Thread[main,5,main] [class
> > org.apache.jena.tdb.transaction.DatasetGraphTransaction]
> > INFO Thread[main,5,main] [class org.apache.jena.shared.LockMRSW]
> > INFO Thread[main,5,main] [class org.apache.jena.tdb.store.
> DatasetGraphTDB]
> > INFO Thread[main,5,main] [class org.apache.jena.shared.LockMRSW]
> >
> > Regards Dick.
>
>


Understanding DatasetGraph getLock() (DatasetGraphInMem throwing a curve ball)...

2017-03-24 Thread Dick Murray
Hi.

Is there a way to get what Transactional a DatasetGraph is using and
specifically what Lock semantics are in force?

As part of a distributed DatasetGraph implementation I have a
DatasetGraphTry wrapper which adds Boolean tryBegin(ReadWrite) and as the
name suggests it will try to lock the given DatasetGraph and return
immediately, i.e. not block. Internally if it acquires the lock it will
call the wrapped void begin(ReadWrite) which "should" not block. This is
useful because I can round robin the DatasetGraph's which constitute the
distribution without blocking. Especially useful as some of the
DatasetGraph's are running in other JVM's.

Currently I've reverted the mapping to the DatasetGraph class (requires I
manually check the Jena code) but I'd like to understand why and possibly
make the code neater...

To automate the wrapping I pulled the Lock via getLock() and used the class
to lookup the appropriate wrapper. But after digging I noticed that the
Lock from getLock() doesn't always match the Transactional locking
semantics.

DatasetGraphInMem getLock() returns org.apache.jena.shared.LockMRSW but
internally its Transactional implementation is
using org.apache.jena.shared.LockMRPlusSW which is subtly different. This
is noticeable because getLock() isn't overridden but inherits from
DatasetGraphBase which declares LockMRSW.

A TDB backed DatasetGraph masquerades as a;

DatasetGraphTransaction

DatasetGraphTrackActive

DatasetGraphWrapper

which wraps the DatasetGraphTDB

DatasetGraphTripleQuads

DatasetGraphBaseFind

DatasetGraphBase where the getLock() returns



INFO Thread[main,5,main] [class
org.apache.jena.sparql.core.mem.DatasetGraphInMemory]
INFO Thread[main,5,main] [class org.apache.jena.shared.LockMRSW]

INFO Thread[main,5,main] [class
org.apache.jena.tdb.transaction.DatasetGraphTransaction]
INFO Thread[main,5,main] [class org.apache.jena.shared.LockMRSW]
INFO Thread[main,5,main] [class org.apache.jena.tdb.store.DatasetGraphTDB]
INFO Thread[main,5,main] [class org.apache.jena.shared.LockMRSW]

Regards Dick.


DatasetGraph, Context serialization and thrift implementation, BNode distribution/collision.

2017-03-03 Thread Dick Murray
Hi.

Question regarding the design thoughts behind Context and the callbacks.
Also merging BNodes...

I have implemented a Thrift based RPC DatasetGraph consisting of a Client
(implements DatasetGraph) which forwards calls to an IFace (generated from
a Thrift file which closely mimics the DatasetGraph interface with some
method name tweaks to handle thrift nuances such as not supporting method
overloading). The IFace wraps a DatasetGraph.

The IFace supports all of the DatasetGraph interface (including RPC lock
and transaction support) with the exception of getContext(), currently it
returns Context.emptyContext. Context and Symbol don't implement
Serializable, which in itself can be overcome. But I stumped at the
callback part of Context. What is it? What does it do? (besides the
obvious!).

In the bigger picture I implement a distributed DatasetGraph which contains
a set of IFace endpoints. Thus when find(Quad) is called it makes a set of
RPC calls to the IFace endpoints and aggregates the results. Internally
locks are applied when needed, in particular write locks are weighted e.g.
add(x, s, p, o) will attempt to lock the IFace which has graph x (i.e. it
checks for the graph before doing the write lock and add). Basically
beginTransaction(ReadWrite) on the DatsetGraphDistributed won't actually
call beginTransaction(ReadWrite) on an IFace until it needs to. This allows
multiple IFace endpoints to be in write transactions. To support the thread
affinity of DatasetGraph the IFace endpoints use a UUID to delegate from
the Thrift thread pool thread (i.e. the one servicing the RPC call) to the
same thread which actually performs the wrapped DatasetGraph action.
Additionally the underlying DatasetGraph can be accessed as usual whilst
being wrapped by the IFace which supports RPC calls into the same
DatasetGraph.

Anyway...

What was the Context callback designed for? Is it ever used?

If I have a central Context which I push to the IFace endpoints would that
cause me any issues? Similar idea to a central config...

Will BNodes in two DatasetGraph's ever collide?

Dick.


Re: Release vote : 3.2.0

2017-02-01 Thread Dick Murray
;-) Nothing implied from me and as I thought re Linux/Dev. Thanks (devs)
for the work.

On 1 Feb 2017 19:33, "A. Soroka" <aj...@virginia.edu> wrote:

> No, I should say that that exclusion is just a nod to the fact that so
> many of the Jena devs use Linux that it's just much less of an issue to
> find Linux testers. Windows seems to be generally the hardest platform to
> get results for. I certainly didn't intend any more than that, but I copied
> that list from earlier release vote announcements. (!)
>
> But maybe I am missing some history?
>
> ajs6f
>
> > On Feb 1, 2017, at 2:30 PM, Dick Murray <dandh...@gmail.com> wrote:
> >
> > Hi.
> >
> > Under checking Windows and Mac OS's are listed but not Linux. Is Jena
> > assumed to pass? I'mean running Jena 3.2 snapshot on Ubuntu 16.04 and
> > Centos 7.
> >
> > If you haven't broken anything in the snapshot then I vote release. ;-)
> >
> > On 1 Feb 2017 16:09, "A. Soroka" <aj...@virginia.edu> wrote:
> >
> >> Hello, Jena-folks!
> >>
> >> Let's vote on a release of Jena 3.2.0.
> >>
> >> Everyone, not just committers, is invited to test and vote. Three +1's
> >> from PMC members permit a release, but everyone is not just welcome but
> >> _needed_ to do really good full testing. If a non-committer turns up an
> >> issue, you can bet I will investigate fast.
> >>
> >> This is a distribution of Jena and also of Fuseki 1 and 2.
> >>
> >> Versions being released include: Jena @ 3.2.0 (RDF libraries, database
> >> gear, and utilities), Fuseki 1 @ 1.5.0 and Fuseki 2 @ 2.5.0 (SPARQL
> >> servers).
> >>
> >> Staging repository:
> >> https://repository.apache.org/content/repositories/orgapachejena-1016/
> >>
> >> Proposed distributions:
> >> https://dist.apache.org/repos/dist/dev/jena/binaries/
> >>
> >> Keys:
> >> https://svn.apache.org/repos/asf/jena/dist/KEYS
> >>
> >> Git tag:
> >> jena-3.2.0-rc1
> >> 4bdc528c788681b90acf341de0989ca7686bae8c
> >> https://git-wip-us.apache.org/repos/asf?p=jena.git;a=commit;h=
> >> 4bdc528c788681b90acf341de0989ca7686bae8c
> >>
> >>
> >> Please vote to approve this release:
> >>
> >>[ ] +1 Approve the release
> >>[ ]  0 Don't care
> >>[ ] -1 Don't release, because ...
> >>
> >> This vote will be open to the end of
> >>
> >>   Monday, 6 February, 23:59 UTC
> >>
> >> Thanks to everyone who can help test and give feedback of every kind!
> >>
> >>  ajs6f (A. Soroka)
> >>
> >>
> >> Checking needed:
> >>
> >> • Does everything work on MS Windows?
> >> • Does everything work on OS X?
> >> • Is the GPG signature okay?
> >> • Is there a source archive?
> >> • Can the source archive really be built?
> >> • Is there a correct LICENSE and NOTICE file in each artifact (both
> source
> >> and binary artifacts)?
> >> • Does the NOTICE file contain all necessary attributions?
> >> • Does the tag in the SCM contain reproducible sources?
> >>
> >>
> >>
> >>
> >>
>
>


Re: Release vote : 3.2.0

2017-02-01 Thread Dick Murray
Hi.

Under checking Windows and Mac OS's are listed but not Linux. Is Jena
assumed to pass? I'mean running Jena 3.2 snapshot on Ubuntu 16.04 and
Centos 7.

If you haven't broken anything in the snapshot then I vote release. ;-)

On 1 Feb 2017 16:09, "A. Soroka"  wrote:

> Hello, Jena-folks!
>
> Let's vote on a release of Jena 3.2.0.
>
> Everyone, not just committers, is invited to test and vote. Three +1's
> from PMC members permit a release, but everyone is not just welcome but
> _needed_ to do really good full testing. If a non-committer turns up an
> issue, you can bet I will investigate fast.
>
> This is a distribution of Jena and also of Fuseki 1 and 2.
>
> Versions being released include: Jena @ 3.2.0 (RDF libraries, database
> gear, and utilities), Fuseki 1 @ 1.5.0 and Fuseki 2 @ 2.5.0 (SPARQL
> servers).
>
> Staging repository:
> https://repository.apache.org/content/repositories/orgapachejena-1016/
>
> Proposed distributions:
> https://dist.apache.org/repos/dist/dev/jena/binaries/
>
> Keys:
> https://svn.apache.org/repos/asf/jena/dist/KEYS
>
> Git tag:
> jena-3.2.0-rc1
> 4bdc528c788681b90acf341de0989ca7686bae8c
> https://git-wip-us.apache.org/repos/asf?p=jena.git;a=commit;h=
> 4bdc528c788681b90acf341de0989ca7686bae8c
>
>
> Please vote to approve this release:
>
> [ ] +1 Approve the release
> [ ]  0 Don't care
> [ ] -1 Don't release, because ...
>
> This vote will be open to the end of
>
>Monday, 6 February, 23:59 UTC
>
> Thanks to everyone who can help test and give feedback of every kind!
>
>   ajs6f (A. Soroka)
>
>
> Checking needed:
>
> • Does everything work on MS Windows?
> • Does everything work on OS X?
> • Is the GPG signature okay?
> • Is there a source archive?
> • Can the source archive really be built?
> • Is there a correct LICENSE and NOTICE file in each artifact (both source
> and binary artifacts)?
> • Does the NOTICE file contain all necessary attributions?
> • Does the tag in the SCM contain reproducible sources?
>
>
>
>
>


Re: Jena 3.2.0-SNAPSHOT Node.ANY serialization causes StackOverFlow when called from Kryo JavaSerialization.

2017-01-23 Thread Dick Murray
Hi. Thanks for the detail. I think the problem is in Kryo as it seems to be
using FieldSerializaer no matter what you tell it. Thus Node_URI is
probably working by accident!

I'll get to the bottom of it and update for prosperity...

On 22 January 2017 at 11:02, Andy Seaborne <a...@apache.org> wrote:

> Hi,
>
> Test TestSerializable.serialize_node_04 serializes and deserializes a
> Node.ANY.
>
> All serialization is handled via Node.writeReplace() indirecting to SNode
> and SNode.readResolve() to deserialize.
>
> The provided serializer is Jena's RDF Thrift.  ThriftConvert.toThrift
> handles the conversion for Thrift.
>
> There is nothing in Node.ANY.  Node.writeReplace points to
> oaj.system.SerializerRDF.
>
> Does JavaSerializer handle writeReplace?  Default serialization of
> Node_URI may work by accident because it has a field.
>
> Initialization is in org.apache.jena.system.SerializerRDF, which is
> called from InitRIOT which is called by by system initialization based on
> ServiceLoader.
>
> Andy
>
>
> On 20/01/17 18:35, Dick Murray wrote:
>
>> Whilst this issue is reported and possibly caused by Kryo I think it's my
>> understanding of how Jena is or is not serializing...
>>
>> I'm using Jena 3.2.0-SNAPSHOT and Kryo(Net) to serialize Jena nodes but
>> Kryo baulks when asked to handle a (the) Node_ANY;
>>
>> Exception in thread "Server" java.lang.StackOverflowError
>> at
>> com.esotericsoftware.kryo.io.ByteBufferOutput.writeVarInt(By
>> teBufferOutput.java:323)
>> at
>> com.esotericsoftware.kryo.util.DefaultClassResolver.writeCla
>> ss(DefaultClassResolver.java:102)
>> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
>> at
>> com.esotericsoftware.kryo.serializers.ObjectField.write(Obje
>> ctField.java:76)
>> at
>> com.esotericsoftware.kryo.serializers.FieldSerializer.write(
>> FieldSerializer.java:518)
>> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>> at
>> com.esotericsoftware.kryo.serializers.ObjectField.write(Obje
>> ctField.java:80)
>> at
>> com.esotericsoftware.kryo.serializers.FieldSerializer.write(
>> FieldSerializer.java:518)
>> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>> at
>> com.esotericsoftware.kryo.serializers.ObjectField.write(Obje
>> ctField.java:80)
>> at
>> com.esotericsoftware.kryo.serializers.FieldSerializer.write(
>> FieldSerializer.java:518)
>>
>> Basically the server side dies horribly!
>>
>> I've looked at SerializerRDF and Node and Node.ANY (Node_Fluid) and I
>> cannot see where it goes wrong.
>>
>> Node_URI works without issue and a breakpoint on protected Object
>> writeReplace() throws ObjectStreamException {...} will break. However
>> Node_ANY never triggers a break...
>>
>> Does Node_ANY have the code to actually serialize itself? Should I be
>> overriding something and returning the Node_ANY?
>>
>> NB: I need to register the Node_URI class a JavaSerialization because Kryo
>> was trying to FieldSerialize and that was not working!
>>
>> Just the shared Network class as follows;
>>
>> package org.twocows.jena.mosaic.kryo;
>>
>> import java.util.Iterator;
>>
>> NB: I need to register the Node_URI class a JavaSerialization because Kryo
>> was trying to FieldSerialize and that was not working!
>>
>> import org.apache.jena.graph.Node;
>> import org.apache.jena.graph.Node_ANY;
>> import org.apache.jena.graph.Node_URI;
>> import org.twocows.jena.mosaic.MosaicDatasetGraph;
>>
>> import com.esotericsoftware.kryo.Kryo;
>> import com.esotericsoftware.kryo.Serializer;
>> import com.esotericsoftware.kryo.io.Input;
>> import com.esotericsoftware.kryo.io.Output;
>> import com.esotericsoftware.kryo.serializers.JavaSerializer;
>> import com.esotericsoftware.kryonet.EndPoint;
>> import com.esotericsoftware.kryonet.rmi.ObjectSpace;
>> import com.github.jsonldjava.core.RDFDataset.Quad;
>>
>> public class Network {
>> public static final int port = 1972;
>>
>> public static final int MOSAIC_DATASET_GRAPH = 1;
>> // This registers objects that are going to be sent over the network.
>> static public void register (final EndPoint endPoint) {
>> Kryo kryo = endPoint.getKryo();
>> // This must be called in order to use ObjectSpaces.
>> ObjectSpace.registerClasses(kryo);
>> // The interfaces that will be used as remote objects must be registered.
>> kryo.register(MosaicDatasetGraph.class);
>> // The classes of all method parameters and return values
>> // for remote objects must also be registered.
>> kryo.register(ExceptionInInitializerError.class);
>> kryo.register(UnsupportedOperationException.class);
>> kryo.register(String.class);
>> kryo.register(Node.class, new JavaSerializer());
>> kryo.register(Node_URI.class, new JavaSerializer());
>> kryo.register(Node_ANY.class, new JavaSerializer());
>> kryo.register(Quad.class);
>> kryo.register(Iterator.class);
>> }
>> }
>>
>>


Jena 3.2.0-SNAPSHOT Node.ANY serialization causes StackOverFlow when called from Kryo JavaSerialization.

2017-01-20 Thread Dick Murray
Whilst this issue is reported and possibly caused by Kryo I think it's my
understanding of how Jena is or is not serializing...

I'm using Jena 3.2.0-SNAPSHOT and Kryo(Net) to serialize Jena nodes but
Kryo baulks when asked to handle a (the) Node_ANY;

Exception in thread "Server" java.lang.StackOverflowError
at
com.esotericsoftware.kryo.io.ByteBufferOutput.writeVarInt(ByteBufferOutput.java:323)
at
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:102)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
at
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:76)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)

Basically the server side dies horribly!

I've looked at SerializerRDF and Node and Node.ANY (Node_Fluid) and I
cannot see where it goes wrong.

Node_URI works without issue and a breakpoint on protected Object
writeReplace() throws ObjectStreamException {...} will break. However
Node_ANY never triggers a break...

Does Node_ANY have the code to actually serialize itself? Should I be
overriding something and returning the Node_ANY?

NB: I need to register the Node_URI class a JavaSerialization because Kryo
was trying to FieldSerialize and that was not working!

Just the shared Network class as follows;

package org.twocows.jena.mosaic.kryo;

import java.util.Iterator;

NB: I need to register the Node_URI class a JavaSerialization because Kryo
was trying to FieldSerialize and that was not working!

import org.apache.jena.graph.Node;
import org.apache.jena.graph.Node_ANY;
import org.apache.jena.graph.Node_URI;
import org.twocows.jena.mosaic.MosaicDatasetGraph;

import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.Serializer;
import com.esotericsoftware.kryo.io.Input;
import com.esotericsoftware.kryo.io.Output;
import com.esotericsoftware.kryo.serializers.JavaSerializer;
import com.esotericsoftware.kryonet.EndPoint;
import com.esotericsoftware.kryonet.rmi.ObjectSpace;
import com.github.jsonldjava.core.RDFDataset.Quad;

public class Network {
public static final int port = 1972;

public static final int MOSAIC_DATASET_GRAPH = 1;
// This registers objects that are going to be sent over the network.
static public void register (final EndPoint endPoint) {
Kryo kryo = endPoint.getKryo();
// This must be called in order to use ObjectSpaces.
ObjectSpace.registerClasses(kryo);
// The interfaces that will be used as remote objects must be registered.
kryo.register(MosaicDatasetGraph.class);
// The classes of all method parameters and return values
// for remote objects must also be registered.
kryo.register(ExceptionInInitializerError.class);
kryo.register(UnsupportedOperationException.class);
kryo.register(String.class);
kryo.register(Node.class, new JavaSerializer());
kryo.register(Node_URI.class, new JavaSerializer());
kryo.register(Node_ANY.class, new JavaSerializer());
kryo.register(Quad.class);
kryo.register(Iterator.class);
}
}


Re: How I handle "Null Pointer Exception"

2017-01-18 Thread Dick Murray
Sorry, that should have been "not" asked on the Jena user group...


On 18 Jan 2017 7:09 pm, "Dick Murray" <dandh...@gmail.com> wrote:

You need to learn the difference between == and .equals().

Please read up on basic Java skills! These questions should be asked on the
Jena user group...



On 18 Jan 2017 1:14 pm, "Sidra shah" <s.shahcyp...@gmail.com> wrote:

Hello Lorenz, its not giving me the exception now but it does not display
the message JOption,. It does not read the  * if (s1=="CatPhysics")*


RDFNode phFav=indiv.getPropertyValue(favcat);
 if (phFav!=null){
  RDFNode l1=phFav.asResource();


  String s1=l1.toString();
  }

if (s1=="CatPhysics"){
 JOptionPane.showMessageDialog(null, "Phyics category");
  }



On Wed, Jan 18, 2017 at 2:12 PM, Sidra shah <s.shahcyp...@gmail.com> wrote:

> Thank you Lorenz, let me read the document you mention here. I will come
> back after reading and applying.
>
> Best regards
>
> On Wed, Jan 18, 2017 at 1:46 PM, Lorenz B. <buehm...@informatik.uni-
> leipzig.de> wrote:
>
>> What is for you the "value of a resource"? The URI?
>>
>> There is only one good source for developers, and that's Javadoc [1] -
>> that's why we always refer to if people have questions.
>>
>> asResource() converts the RDFNode object to a resource
>>
>> [1]
>> https://jena.apache.org/documentation/javadoc/jena/org/
>> apache/jena/rdf/model/RDFNode.html
>>
>> > Hello Lorenz, this was the question I was expected to ask? I mean
>> values of
>> > BestCategory are resources.
>> > Kindly if you can guide me how to get the value, I searched it on the
>> web
>> > but could not found any related resources.
>> >
>> > Kind regards
>> >
>> > On Wed, Jan 18, 2017 at 10:57 AM, Lorenz B. <
>> > buehm...@informatik.uni-leipzig.de> wrote:
>> >
>> >>> OntModel model2=ModelFactory.createOntologyModel(
>> >> OntModelSpec.OWL_DL_MEM);
>> >>>InputStream in =FileManager.get().open("F://20-8.owl");
>> >>> if (in==null) {
>> >>> throw new IllegalArgumentException( "File: " +  " not
>> >>> found");
>> >>> }  model2.read(in,"");
>> >>>
>> >>>  String ns="
>> >>> http://www.semanticweb.org/t/ontologies/2016/7/myOWL#;;
>> >>>
>> >>>OntProperty favcat=model2.getOntProperty(ns+ "BestCategory");
>> >>> String  name=jTextField1.getText();
>> >>> Individual indiv = user1.createIndividual(ns + name);
>> >>>   RDFNode phFav=indiv.getPropertyValue(favcat);
>> >>>  if (phFav!=null){
>> >>>   Literal l1=phFav.asLiteral();
>> >> If BestCategory is an object property, why do you cast the value as
>> >> literal?!
>> >>>s1=l1.toString(); }
>> >>>  }
>> >>> if (s1=="CatPhysics"){
>> >>>  JOptionPane.showMessageDialog(null, "Physics");
>> >>>   }
>> >>>
>> >>> The rule itself is
>> >>>
>> >>> String rule ="[rule1: ( ?x http://www.semanticweb.org/
>> >>> t/ontologies/2016/7/myOWL#Physics_Preferred_Category  ?cat1 )" +
>> >>>  "( ?x http://www.semanticweb.org/t/ontologies/2016/7/myOWL#
>> >>> Chem_Preferred_Category  ?cat2 )" +
>> >>> "( ?x http://www.semanticweb.org/t/o
>> ntologies/2016/7/myOWL#Geo_
>> >>> Preferred_Category  ?cat3 )" +
>> >>> "greaterThan(?cat1,?cat2), greaterThan(?cat1,?cat3)"
>> >>>  + " ->  (?x  http://www.semanticweb.org/t/
>> >>> ontologies/2016/7/myOWL#BestCategory   http://www.semanticweb.org/t/
>> >>> ontologies/2016/7/myOWL#BestCategory#Physics   )]";
>> >>>
>> >>>
>> >>> The BestCategory is object property.
>> >>>
>> >>> Regards
>> >>>
>> >>>
>> >>> On Tue, Jan 17, 2017 at 8:16 PM, Andy Seaborne <a...@apache.org>
>> wrote:
>> >>>
>> >>>> A Complete, Minimal Example please.
>> >>>>
>> >>>>
>>

Re: How I handle "Null Pointer Exception"

2017-01-18 Thread Dick Murray
You need to learn the difference between == and .equals().

Please read up on basic Java skills! These questions should be asked on the
Jena user group...


On 18 Jan 2017 1:14 pm, "Sidra shah"  wrote:

Hello Lorenz, its not giving me the exception now but it does not display
the message JOption,. It does not read the  * if (s1=="CatPhysics")*


RDFNode phFav=indiv.getPropertyValue(favcat);
 if (phFav!=null){
  RDFNode l1=phFav.asResource();


  String s1=l1.toString();
  }

if (s1=="CatPhysics"){
 JOptionPane.showMessageDialog(null, "Phyics category");
  }



On Wed, Jan 18, 2017 at 2:12 PM, Sidra shah  wrote:

> Thank you Lorenz, let me read the document you mention here. I will come
> back after reading and applying.
>
> Best regards
>
> On Wed, Jan 18, 2017 at 1:46 PM, Lorenz B.  leipzig.de> wrote:
>
>> What is for you the "value of a resource"? The URI?
>>
>> There is only one good source for developers, and that's Javadoc [1] -
>> that's why we always refer to if people have questions.
>>
>> asResource() converts the RDFNode object to a resource
>>
>> [1]
>> https://jena.apache.org/documentation/javadoc/jena/org/
>> apache/jena/rdf/model/RDFNode.html
>>
>> > Hello Lorenz, this was the question I was expected to ask? I mean
>> values of
>> > BestCategory are resources.
>> > Kindly if you can guide me how to get the value, I searched it on the
>> web
>> > but could not found any related resources.
>> >
>> > Kind regards
>> >
>> > On Wed, Jan 18, 2017 at 10:57 AM, Lorenz B. <
>> > buehm...@informatik.uni-leipzig.de> wrote:
>> >
>> >>> OntModel model2=ModelFactory.createOntologyModel(
>> >> OntModelSpec.OWL_DL_MEM);
>> >>>InputStream in =FileManager.get().open("F://20-8.owl");
>> >>> if (in==null) {
>> >>> throw new IllegalArgumentException( "File: " +  " not
>> >>> found");
>> >>> }  model2.read(in,"");
>> >>>
>> >>>  String ns="
>> >>> http://www.semanticweb.org/t/ontologies/2016/7/myOWL#;;
>> >>>
>> >>>OntProperty favcat=model2.getOntProperty(ns+ "BestCategory");
>> >>> String  name=jTextField1.getText();
>> >>> Individual indiv = user1.createIndividual(ns + name);
>> >>>   RDFNode phFav=indiv.getPropertyValue(favcat);
>> >>>  if (phFav!=null){
>> >>>   Literal l1=phFav.asLiteral();
>> >> If BestCategory is an object property, why do you cast the value as
>> >> literal?!
>> >>>s1=l1.toString(); }
>> >>>  }
>> >>> if (s1=="CatPhysics"){
>> >>>  JOptionPane.showMessageDialog(null, "Physics");
>> >>>   }
>> >>>
>> >>> The rule itself is
>> >>>
>> >>> String rule ="[rule1: ( ?x http://www.semanticweb.org/
>> >>> t/ontologies/2016/7/myOWL#Physics_Preferred_Category  ?cat1 )" +
>> >>>  "( ?x http://www.semanticweb.org/t/ontologies/2016/7/myOWL#
>> >>> Chem_Preferred_Category  ?cat2 )" +
>> >>> "( ?x http://www.semanticweb.org/t/o
>> ntologies/2016/7/myOWL#Geo_
>> >>> Preferred_Category  ?cat3 )" +
>> >>> "greaterThan(?cat1,?cat2), greaterThan(?cat1,?cat3)"
>> >>>  + " ->  (?x  http://www.semanticweb.org/t/
>> >>> ontologies/2016/7/myOWL#BestCategory   http://www.semanticweb.org/t/
>> >>> ontologies/2016/7/myOWL#BestCategory#Physics   )]";
>> >>>
>> >>>
>> >>> The BestCategory is object property.
>> >>>
>> >>> Regards
>> >>>
>> >>>
>> >>> On Tue, Jan 17, 2017 at 8:16 PM, Andy Seaborne 
>> wrote:
>> >>>
>>  A Complete, Minimal Example please.
>> 
>> 
>>  Partial code, no data is not complete.
>>  It must compile and run to be complete.
>> 
>>  Minimal means only what is necessary to ask the question not the
>> whole
>>  data or whole application.
>> 
>>  Andy
>> 
>>  On 17/01/17 17:14, Sidra shah wrote:
>> 
>> > I am surprise that when there is no value in BestCategory, it gives
>> me
>> >> no
>> > error and when the rule executes and value comes in BestCategory,
it
>> >> gives
>> > me now *"RequiredLiteralException*"
>> >
>> > The code I used here is
>> >
>> >  OntProperty favcat=model2.getOntProperty(ns+ "BestCategory");
>> >
>> > RDFNode phFav=indiv.getPropertyValue(favcat);
>> >  if (phFav!=null){
>> >   Literal l1=phFav.asLiteral();
>> >
>> >s1=l1.toString();}
>> >
>> > if (s1=="CatPhysics"){
>> >  JOptionPane.showMessageDialog(null, "Physics");
>> >   }
>> >
>> > Best regards
>> >
>> > On Tue, Jan 17, 2017 at 5:53 PM, Sidra shah > >
>> > wrote:
>> >
>> > Hello Chris, thanks a lot for your suggestion.
>> >> Best regards.
>> >>
>> >> On Tue, Jan 17, 2017 at 5:37 PM, Chris Dollin <
>> >> 

Re: What are the Alternatives of DBpedia

2017-01-15 Thread Dick Murray
Google for example RDF datasets in a serialisation supported by Jena.

A web search really is your best friend for this...


On 15 Jan 2017 3:13 pm, "kumar rohit"  wrote:

I want to know, like DBpdia, what are other sources where we can get data
from and supported also by Jena. I have read about wikidata but dont know
how to process it like we do DBpedia from Jena?


Re: Semantic Of Jena rule

2017-01-12 Thread Dick Murray
An example rule which you can test and then expand on is;

[Manager: (?E rdf:type NS:Employee), (?E NS:netSalary ?S), greaterThan (?S,
5000) -> (?X rdf:type NS:Manager)]

Also see https://jena.apache.org/documentation/inference/


On 12 Jan 2017 19:15, "tina sani"  wrote:

Well, I am not sure about greaterThan and lessThan keywords. Will this rule
execute if it encounters salary b/w 5000 and 10,000?

On Thu, Jan 12, 2017 at 8:53 PM, Joint  wrote:

>
>
> Have you tried it? What happened?
> "Of course this will not execute because I skip proper syntax"
> So you know the syntax isn't correct but still ask if it is correct..
> Dick
>
>  Original message 
> From: tina sani 
> Date: 12/01/2017  14:47  (GMT+00:00)
> To: users@jena.apache.org
> Subject: Semantic Of Jena rule
>
> The syntax and semantic of this rule is correct?
>
> ?emp rdf:type URI:Employee + ?emp URI:NetSalary ?salary+
> greaterThan(?salary, 5000), lessThan(?salary, 1)-> ?emp rdf:type
> URI:Manager
>
> Of course this will not execute because I skip proper syntax, but I wonder
> this rule will work or not if some employee have salary between 5000 and
> 1.
> I am confuse in greaterThan and lessThan part of the rule, if it will work
> or not?
>


Re: Jena and Spark and Elephas

2016-12-22 Thread Dick Murray
On 22 Dec 2016 8:14 pm, "Andy Seaborne" <a...@apache.org> wrote:



On 22/12/16 14:48, Joint wrote:

>
>
> Hi Andy.
> I noticed the WIP status. How does the data get separated from the
> prefixes, it's in the same file! Unless the file gets split...
>

Exactly!

In some systems splitting is not controlled by the app but by the
distributor of data.  It sounds like in yours that you are writing large
files to the persistent layer, correct?


Yes, we have currently written ~1500 triple files each between 1-4M triples
with a few pushing 9M triples. The file name is the graph name. We have a
~100k unique properties per file, large files simply have more objects. The
equivalent TDB is at 1.5TB... Our ETL can process asynchronously which is
helping and the spark seems to be working :-)



My reader requires a prefix file when the triple file is opened otherwise
> it throws an exception. Thus the data file can be split as long as the
> prefix file is along for the ride.
> Is the Elephas compression in memory or disk or both. Our compression
> requirement is driven by the deployment environment which charges for the
> disk writes to the SLA storage.
>

Not sure.


The RDDs are free as they go into memory or local disk. Effectively I'm
> persisting a dataset as a group of triple files which get loaded into RDDs
> and processed as a read-only dataset. This also allows us to perform
> multiple graph writes so we can load data in parallel.
>
> Dick
>
>  Original message 
> From: Andy Seaborne <a...@apache.org>
> Date: 22/12/2016  11:30  (GMT+00:00)
> To: users@jena.apache.org
> Subject: Re: Jena and Spark and Elephas
>
> RDF-Patch update is WIP - in fact one of the potential things is
> removing the prefix stuff. (Reasons: data gets separated from its
> prefixes/; want to add prefix management of the data to RDF Patch.)
>
> Elephas has various compression options. Would any of these work for you?
>
> I find that compressing n-triples gives x5 to x10 compression so applied
> to RDD data I'd expect that or more.
>
> There are line based output formats (I don't know if they work with
> Elephas - no reason why not in principle).
>
> http://jena.apache.org/documentation/io/rdf-output.html#
> line-printed-formats
>
> See RDFFormat TURTLE_FLAT.
>
> Just don't loose the prefixes!
>
>  Andy
>
>
>
> On 21/12/16 20:46, Dick Murray wrote:
>
>> So basically I've got RDF Patch with a default A which I use to build the
>> Apache Spark RDD...
>>
>> A quick Google got me a git master updated 4 years ago, but no code, but
>> the thread says Andy is using the code..?
>>
>> Like you said probably one for Andy.
>>
>> Thanks for the pointer.
>>
>> On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote:
>>
>> Andy can say more, but RDF Patch may be heading in a direction where it
>> could be used for such a purpose:
>>
>> https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
>> 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E
>>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>> On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote:
>>>
>>> Hi, on a similar vein I have a modified NTriple reader which uses a
>>> prefix
>>> file to reduce the file size. Whilst the serialisation allows parallel
>>> processing in spark the file sizes were large and this has reduced them
>>> to
>>> 1/10 the original size on average.
>>>
>>> There is not an existing line based serialisation with some for of
>>> prefixing is there?
>>>
>>> On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote:
>>>
>>> Related:
>>>>
>>>> Jena now provides "Serializable" for Triple/Quad/Node
>>>>
>>>> It did not make 3.1.1, it's in development snapshots and in the next
>>>> release.
>>>>
>>>> Use with spark was the original motivation.
>>>>
>>>> Andy
>>>>
>>>> https://issues.apache.org/jira/browse/JENA-1233
>>>>
>>>> On 17/12/16 09:14, Joint wrote:
>>>>
>>>>
>>>>>
>>>>> Hi.
>>>>> I was about to use the above to wrap some quads and spoof the RDDs as
>>>>> graphs from within a dataset but before I do has this been done before?
>>>>>
>>>> I
>>
>>> have some code which calls the RDD methods from the graph base find. Not
>>>>> wanting to invent the wheel and such...
>>>>>
>>>>>
>>>>> Dick
>>>>>
>>>>>
>>>>>
>>


Re: Jena and Spark and Elephas

2016-12-21 Thread Dick Murray
So basically I've got RDF Patch with a default A which I use to build the
Apache Spark RDD...

A quick Google got me a git master updated 4 years ago, but no code, but
the thread says Andy is using the code..?

Like you said probably one for Andy.

Thanks for the pointer.

On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote:

Andy can say more, but RDF Patch may be heading in a direction where it
could be used for such a purpose:

https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E

---
A. Soroka
The University of Virginia Library

> On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote:
>
> Hi, on a similar vein I have a modified NTriple reader which uses a prefix
> file to reduce the file size. Whilst the serialisation allows parallel
> processing in spark the file sizes were large and this has reduced them to
> 1/10 the original size on average.
>
> There is not an existing line based serialisation with some for of
> prefixing is there?
>
> On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote:
>
>> Related:
>>
>> Jena now provides "Serializable" for Triple/Quad/Node
>>
>> It did not make 3.1.1, it's in development snapshots and in the next
>> release.
>>
>> Use with spark was the original motivation.
>>
>>Andy
>>
>> https://issues.apache.org/jira/browse/JENA-1233
>>
>> On 17/12/16 09:14, Joint wrote:
>>
>>>
>>>
>>> Hi.
>>> I was about to use the above to wrap some quads and spoof the RDDs as
>>> graphs from within a dataset but before I do has this been done before?
I
>>> have some code which calls the RDD methods from the graph base find. Not
>>> wanting to invent the wheel and such...
>>>
>>>
>>> Dick
>>>
>>>


Re: Jena and Spark and Elephas

2016-12-21 Thread Dick Murray
Hi, on a similar vein I have a modified NTriple reader which uses a prefix
file to reduce the file size. Whilst the serialisation allows parallel
processing in spark the file sizes were large and this has reduced them to
1/10 the original size on average.

There is not an existing line based serialisation with some for of
prefixing is there?

On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:

> Related:
>
> Jena now provides "Serializable" for Triple/Quad/Node
>
> It did not make 3.1.1, it's in development snapshots and in the next
> release.
>
> Use with spark was the original motivation.
>
> Andy
>
> https://issues.apache.org/jira/browse/JENA-1233
>
> On 17/12/16 09:14, Joint wrote:
>
>>
>>
>> Hi.
>> I was about to use the above to wrap some quads and spoof the RDDs as
>> graphs from within a dataset but before I do has this been done before? I
>> have some code which calls the RDD methods from the graph base find. Not
>> wanting to invent the wheel and such...
>>
>>
>> Dick
>>
>>


Re: Jena and Spark and Elephas

2016-12-17 Thread Dick Murray
Excellent, I was currently wrapping and unwrapping as Strings which fixed
another issue along with prefixing bnodes to remove clashes between TDB's.
I'll pull and refactoring my code...

On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:

Related:

Jena now provides "Serializable" for Triple/Quad/Node

It did not make 3.1.1, it's in development snapshots and in the next
release.

Use with spark was the original motivation.

Andy

https://issues.apache.org/jira/browse/JENA-1233


On 17/12/16 09:14, Joint wrote:

>
>
> Hi.
> I was about to use the above to wrap some quads and spoof the RDDs as
> graphs from within a dataset but before I do has this been done before? I
> have some code which calls the RDD methods from the graph base find. Not
> wanting to invent the wheel and such...
>
>
> Dick
>
>


Re: Queries as functions

2016-12-17 Thread Dick Murray
Just posted a question regarding Spark because I'm heading down the
streaming route as we're aggregating multiple large datasets together and
our 1.5TB TDB was causing us some issues. We have many large graph writes
of between 1-4Mb triples which I currently write to a number of TDB's and
use a set of streaming utility methods to aggregate the TDB's for the find
methods. This lends itself to RDD filter calls.


On 16 Dec 2016 21:22, "Andy Seaborne"  wrote:

There are elements of that - see CommonsRDF - though here the operations
are whole objects (dataset - to query is as a Stream would to
collect the tuples).

It is also like building up a executable pipeline of operations but not
doing it until the final step which allows optimization of the pipeline.

c.f. Apache spark.


On 16/12/16 15:43, A. Soroka wrote:

> It seems to me that these ideas begin to border on the Stream API, with
> something like Stream at work.
>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Dec 15, 2016, at 3:46 PM, Andy Seaborne  wrote:
>>
>>
>> A more considered solution:
>>
>> https://gist.github.com/afs/2b8773d10cbe4bc1161e9851de02b3eb
>>
>> Andy
>>
>> On 14/12/16 12:52, Andy Seaborne wrote:
>>
>>>
>>>
>>> On 14/12/16 11:23, Martynas Jusevičius wrote:
>>>
 But that would still require the functional subclasses of Query?

>>>
>>> Yes, but it required no changes to the jena code.  There could be a
>>> library of such utilities.
>>>
>>>static  T apply(X object, Function f) {
>>>return f.apply(object);
>>>}
>>>
>>>static  void apply(X object, Consumer c) {
>>>c.accept(object);
>>>}
>>>
>>>
>>>
>>>Dataset dataset = ... ;
>>>Select selectQuery = new Select("SELECT * { ?s ?p ?o}");
>>>ResultSet rs = selectQuery.apply(dataset);
>>>Consumer rsp = (t)->ResultSetFormatter.out(t);
>>>apply(apply(dataset, selectQuery), rsp);
>>>
>>> (because Consumer isn't a Function<,Void> :-()
>>>
>>>Andy
>>>

 On Wed, Dec 14, 2016 at 11:37 AM, Andy Seaborne 
 wrote:

>
>
> On 12/12/16 21:45, Martynas Jusevičius wrote:
>
>>
>> Well, this probably requires some generic method(s) in Dataset/Model
>> as well, something like:
>>
>>  T apply(Function f);
>>
>> This would allow nice chaining of multiple queries, e.g DESCRIBE and
>> SELECT:
>>
>>  ResultSet results = dataset.apply(describe).apply(select);
>>
>
>
> No need to extend dataset and model and the rest to get experimenting:
>
> static  T apply(X object, Function f) {
>   return f.apply(object);
> }
> // BiFunction, T>
>
>
> then
>
> ResultSet results = apply(
>   apply(dataset, describe),
>   select);
>
>
> The function f does have any access to the internals of a specific
> dataset
> so it does not need to be a method of Dataset.
>
> There is a style thing about how it looks if you are not used to
> reading
> functional application (i.e. backwards!).
>
>Andy
>
>
>
>> Seems more elegant to me than all the QueryExecution boilerplate.
>>
>> On Mon, Dec 12, 2016 at 9:00 PM, A. Soroka 
>> wrote:
>>
>>>
>>> What are the kinds of usages to which you are imagining these kind of
>>> types being put?
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>> On Dec 12, 2016, at 2:03 PM, Martynas Jusevičius
 
 wrote:

 Hey,

 has Jena considered taking advantage of the functional features in
 Java
 8?

 What I have in mind is interfaces like:

 Construct extends Query implements Function

 Describe extends Query implements Function

 Select extends Query implements Function

 Ask extends Query implements Function


 Martynas
 atomgraph.com

>>>
>>>
>>>
>
>


Re: Unsupported major

2016-10-19 Thread Dick Murray
On 19 October 2016 at 16:39, Sandor Kopacsi <sandor.kopa...@univie.ac.at>
wrote:

> Thank you, Dick.
>
> How can avoid this problem next time? Fuseki takes into consideration the
>>> JAVA_HOME system variable? If yes, how and where should I set it? In a
>>> starting script of Fuseki
>>>
>> You can install multiple JRE's and "point" your application at the
>> required
>> one.
>>
>> On windows via the command set JAVA_HOME=c:\jre8
>>
>> And how can I set it in Linux? Is it enough like that?
>

Depends on how you are running Fuseki? You have set, env, declare and
export.

dick@Dick-M3800:~$ name=a
dick@Dick-M3800:~$ echo ${name}
a
dick@Dick-M3800:~$ declare name=b
dick@Dick-M3800:~$ echo ${name}
b
dick@Dick-M3800:~$ export name=c
dick@Dick-M3800:~$ echo ${name}
c

With export the variable is available in sub shells.


>
> export JAVA_HOME= /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
>
> I have read an issue (JENA-1035), that fuseki-server script ignores
> JAVA_HOME variable while it executes the "java" command.
> Has it been fixed?
>

Unknown, as i don't use Fuseki (we rolled our own similar).


>
> Thanks,
> Sandor
>
>
> Am 19.10.2016 um 17:08 schrieb Dick Murray:
>
>> On 19 October 2016 at 15:10, Sandor Kopacsi <sandor.kopa...@univie.ac.at>
>> wrote:
>>
>> Dear Dick,
>>>
>>> You are right. I have Java version 1.6.0_37 and the exception says:
>>>
>>> "minor version 52.0"
>>>
>>> Am I right, that Fuseki 1.3 requires Java 8 (that I used previously)?
>>>
>>> You'll probably want to upgrade to the latest version...
>>
>>
>> I am afraid that the administrators / or the system itself downgraded the
>>> Java version to Java 6, which is the pre set or automatic version.
>>>
>>> Now I switched to Java 8, and now it works.
>>>
>>> How can avoid this problem next time? Fuseki takes into consideration the
>>> JAVA_HOME system variable? If yes, how and where should I set it? In a
>>> starting script of Fuseki
>>>
>>> You can install multiple JRE's and "point" your application at the
>> required
>> one.
>>
>> On windows via the command set JAVA_HOME=c:\jre8
>>
>>
>> Thanks and best regards,
>>> Sandor
>>>
>>>
>>> Am 19.10.2016 um 15:36 schrieb Dick Murray:
>>>
>>> Hi.
>>>>
>>>> Check what version of JRE you have with java -version
>>>>
>>>> dick@Dick-M3800:~$ java -version
>>>> java version "1.8.0_101"
>>>> Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
>>>>
>>>> Your exception should say what version it is having trouble with...
>>>>
>>>> Java SE 9 = 53,
>>>> Java SE 8 = 52,
>>>> Java SE 7 = 51,
>>>> Java SE 6.0 = 50,
>>>> Java SE 5.0 = 49,
>>>> JDK 1.4 = 48,
>>>> JDK 1.3 = 47,
>>>> JDK 1.2 = 46,
>>>> JDK 1.1 = 45
>>>>
>>>>
>>>> On 19 October 2016 at 14:32, Sandor Kopacsi <
>>>> sandor.kopa...@univie.ac.at>
>>>> wrote:
>>>>
>>>> Dear List Members,
>>>>
>>>>> I wanted to start Fuseki 1.3.0 for test purposes, but I got an
>>>>> exception
>>>>> in thread "main" java.lang.UnsupportedClassVersionError:
>>>>> org/apache/jena/fuseki/FusekiCmd : Unsupported major.
>>>>>
>>>>> I do not want to update Fuseki by all means, I just wanted to try
>>>>> something in this test environment.
>>>>> It worked so far well, and I did not changed (deliberately) anything.
>>>>>
>>>>> What can be the reason for that, and what should I do?
>>>>>
>>>>> Thank you in advance and best regards,
>>>>> Sandor
>>>>>
>>>>> --
>>>>> Dr. Sandor Kopacsi
>>>>> IT Software Designer
>>>>>
>>>>> Vienna University Computer Center
>>>>>
>>>>>
>>>>>
>>>>>
> --
> Dr. Sandor Kopacsi
> IT Software Designer
>
> Vienna University Computer Center
> Universitätsstraße 7 (NIG)
> A-1010 Vienna
>
> Phone:  +43-1-4277-14176
> Mobile: +43-664-60277-14176
>
>


Re: Unsupported major

2016-10-19 Thread Dick Murray
On 19 October 2016 at 15:10, Sandor Kopacsi <sandor.kopa...@univie.ac.at>
wrote:

> Dear Dick,
>
> You are right. I have Java version 1.6.0_37 and the exception says:
>
> "minor version 52.0"
>
> Am I right, that Fuseki 1.3 requires Java 8 (that I used previously)?
>

You'll probably want to upgrade to the latest version...


>
> I am afraid that the administrators / or the system itself downgraded the
> Java version to Java 6, which is the pre set or automatic version.
>
> Now I switched to Java 8, and now it works.
>
> How can avoid this problem next time? Fuseki takes into consideration the
> JAVA_HOME system variable? If yes, how and where should I set it? In a
> starting script of Fuseki
>

You can install multiple JRE's and "point" your application at the required
one.

On windows via the command set JAVA_HOME=c:\jre8


> Thanks and best regards,
> Sandor
>
>
> Am 19.10.2016 um 15:36 schrieb Dick Murray:
>
>> Hi.
>>
>> Check what version of JRE you have with java -version
>>
>> dick@Dick-M3800:~$ java -version
>> java version "1.8.0_101"
>> Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
>>
>> Your exception should say what version it is having trouble with...
>>
>> Java SE 9 = 53,
>> Java SE 8 = 52,
>> Java SE 7 = 51,
>> Java SE 6.0 = 50,
>> Java SE 5.0 = 49,
>> JDK 1.4 = 48,
>> JDK 1.3 = 47,
>> JDK 1.2 = 46,
>> JDK 1.1 = 45
>>
>>
>> On 19 October 2016 at 14:32, Sandor Kopacsi <sandor.kopa...@univie.ac.at>
>> wrote:
>>
>> Dear List Members,
>>>
>>> I wanted to start Fuseki 1.3.0 for test purposes, but I got an exception
>>> in thread "main" java.lang.UnsupportedClassVersionError:
>>> org/apache/jena/fuseki/FusekiCmd : Unsupported major.
>>>
>>> I do not want to update Fuseki by all means, I just wanted to try
>>> something in this test environment.
>>> It worked so far well, and I did not changed (deliberately) anything.
>>>
>>> What can be the reason for that, and what should I do?
>>>
>>> Thank you in advance and best regards,
>>> Sandor
>>>
>>> --
>>> Dr. Sandor Kopacsi
>>> IT Software Designer
>>>
>>> Vienna University Computer Center
>>>
>>>
>>>
>


Re: Unsupported major

2016-10-19 Thread Dick Murray
Hi.

Check what version of JRE you have with java -version

dick@Dick-M3800:~$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

Your exception should say what version it is having trouble with...

Java SE 9 = 53,
Java SE 8 = 52,
Java SE 7 = 51,
Java SE 6.0 = 50,
Java SE 5.0 = 49,
JDK 1.4 = 48,
JDK 1.3 = 47,
JDK 1.2 = 46,
JDK 1.1 = 45


On 19 October 2016 at 14:32, Sandor Kopacsi 
wrote:

> Dear List Members,
>
> I wanted to start Fuseki 1.3.0 for test purposes, but I got an exception
> in thread "main" java.lang.UnsupportedClassVersionError:
> org/apache/jena/fuseki/FusekiCmd : Unsupported major.
>
> I do not want to update Fuseki by all means, I just wanted to try
> something in this test environment.
> It worked so far well, and I did not changed (deliberately) anything.
>
> What can be the reason for that, and what should I do?
>
> Thank you in advance and best regards,
> Sandor
>
> --
> Dr. Sandor Kopacsi
> IT Software Designer
>
> Vienna University Computer Center
>
>


Re: Concurrent read with unmatched optional clause causes exception in Jena 3.1.

2016-10-18 Thread Dick Murray
Found the cause and I hate log4j...

Log4j was unable to find it's configuration file so was quietly defaulting
the RootLogger to debug. So when static public final Logger log =
LoggerFactory.getLogger(ReorderTransformationSubstitution.class) was called
it's parent is RootLogger and it's level is debug. So private final boolean
DEBUG = log.isDebugEnabled() sets DEBUG to true and under concurrent load
the exception is thrown.

I only found this because I set -Dlog4j.debug to see what was happening
because the logger level gets set "magically" (i.e. Eclipse IDE triggers a
breakpoint on the variable change seemingly randomly) after initialisation,
presumably because there's some delayed code run when the log4j framework
is touched.

The issue is repeatable without the log4j configuration but "fixed"
with -Dlog4j.configuration=file:///{path to configuration
file}/log4j.properties

Is it worth noting this behaviour somewhere?

Thanks for the point in the right direction.

Dick.

On 17 October 2016 at 21:33, Andy Seaborne <a...@apache.org> wrote:

>
>
> On 17/10/16 21:13, Dick Murray wrote:
>
>> Hi.
>>
>> On 17 Oct 2016 18:16, "Andy Seaborne" <a...@apache.org> wrote:
>>
>>>
>>> Are you running with debug tracing on?
>>>
>>
>> No, should I and what should I look for..?
>>
>
> I asked because that point in the code ...
>
>
> >>>  [java] org.apache.jena.atlas.iterator
> .Iter.asString(Iter.java:479)
> >>>  [java]
> >>>
> > org.apache.jena.sparql.engine.optimizer.reorder.ReorderTrans
> formationSubstitution.reorder(ReorderTransformationSubstitution.java:85)
> >>>
>
> Object field:
> private final boolean DEBUG = log.isDebugEnabled() ;
>
> then line 85:
>
>  if ( DEBUG )
>log.debug("Reorder: "+Iter.asString(components, formatter)) ;
>
> The formatter is, mistakenly, shared because it an object field.
>
> But to get there, DEBUG has to be true.
>
> So it seems log.isDebugEnabled() is true.
>
> Andy
>
>
>
>
>
>>
>>> Andy
>>>
>>>
>>> On 17/10/16 11:30, Dick Murray wrote:
>>>
>>>>
>>>> Hi.
>>>>
>>>> I'm getting odd behaviour in Jena when I execute the same query
>>>> concurrently.
>>>>
>>>> The query has an optional which is unmatched but which appears to cause
>>>> a
>>>> java.lang.String exception from the atlas code.
>>>>
>>>> This only happens if multiple queries are submitted concurrently and
>>>> closely. On a "fast" host both queries will typically fail but on a
>>>>
>>> slower
>>
>>> host (VirtualBox with throttled cpu) the first query will return but the
>>>> second will fail.
>>>>
>>>> If I siege with 10 threads and a random delay of 2 seconds some queries
>>>> will succeed. Which makes it seem like a timing issue.
>>>>
>>>> The crux appears to be the optional, remove it and I can siege (10
>>>>
>>> threads,
>>
>>> 50 iterations) without any issues, with the optional I get exceptions.
>>>>
>>>> I originally thought it was the application/json format but returning
>>>> text/plain causes the exception and the root cause appears to be
>>>>
>>> something
>>
>>> in the atlas code passing in the wrong index to create a new String.
>>>>
>>>> To test if I run (Ubuntu 16.04 desktop) i.e. the same query concurrently
>>>> twice;
>>>>
>>>> wget -O q1 --header "accept: text/plain" "
>>>> http://localhost:8080/tdb/sparql?query=PREFIX dxf: 
>>>>
>>> SELECT
>>
>>> * WHERE { GRAPH
>>>>
>>> 
>>
>>> { { ?equipment dxf:attribute ?attribute  } . OPTIONAL {
>>>> ?tier_two_containment dxf:no_such_property ?equipment ;
>>>> dxf:still_no_such_property ?department_area }}}" & wget -O q2 --header
>>>> "accept: text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX
>>>> dxf:
>>>>  SELECT * WHERE { GRAPH
>>>>  { {
>>>> ?equipment
>>>> dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
>>>> dxf:no_such_property ?equipment ; dxf:still_no_such_property
>>>> ?department_area }}}"
>>>>
>>>>
>>>> the q1 output will show the correct result but q2 will either be 0 bytes
>>>>
>

Re: Concurrent read with unmatched optional clause causes exception in Jena 3.1.

2016-10-17 Thread Dick Murray
On 17 Oct 2016 21:33, "Andy Seaborne" <a...@apache.org> wrote:
>
>
>
> On 17/10/16 21:13, Dick Murray wrote:
>>
>> Hi.
>>
>> On 17 Oct 2016 18:16, "Andy Seaborne" <a...@apache.org> wrote:
>>>
>>>
>>> Are you running with debug tracing on?
>>
>>
>> No, should I and what should I look for..?
>
>
> I asked because that point in the code ...
>
>
>
> >>>  [java]
org.apache.jena.atlas.iterator.Iter.asString(Iter.java:479)
> >>>  [java]
> >>>
> >
org.apache.jena.sparql.engine.optimizer.reorder.ReorderTransformationSubstitution.reorder(ReorderTransformationSubstitution.java:85)
> >>>
>
> Object field:
> private final boolean DEBUG = log.isDebugEnabled() ;

Based on the static, it must be set explicitly (and I'm pretty sure I'm not
setting that) or picking up a hierarchical debug...

static public final Logger log =
LoggerFactory.getLogger(ReorderTransformationSubstitution.class) ;
private final boolean DEBUG = log.isDebugEnabled() ;

>
> then line 85:
>
>  if ( DEBUG )
>log.debug("Reorder: "+Iter.asString(components, formatter)) ;
>
> The formatter is, mistakenly, shared because it an object field.
>
> But to get there, DEBUG has to be true.
>
> So it seems log.isDebugEnabled() is true.
>
> Andy
>
>
>
>
>>
>>>
>>> Andy
>>>
>>>
>>> On 17/10/16 11:30, Dick Murray wrote:
>>>>
>>>>
>>>> Hi.
>>>>
>>>> I'm getting odd behaviour in Jena when I execute the same query
>>>> concurrently.
>>>>
>>>> The query has an optional which is unmatched but which appears to
cause a
>>>> java.lang.String exception from the atlas code.
>>>>
>>>> This only happens if multiple queries are submitted concurrently and
>>>> closely. On a "fast" host both queries will typically fail but on a
>>
>> slower
>>>>
>>>> host (VirtualBox with throttled cpu) the first query will return but
the
>>>> second will fail.
>>>>
>>>> If I siege with 10 threads and a random delay of 2 seconds some queries
>>>> will succeed. Which makes it seem like a timing issue.
>>>>
>>>> The crux appears to be the optional, remove it and I can siege (10
>>
>> threads,
>>>>
>>>> 50 iterations) without any issues, with the optional I get exceptions.
>>>>
>>>> I originally thought it was the application/json format but returning
>>>> text/plain causes the exception and the root cause appears to be
>>
>> something
>>>>
>>>> in the atlas code passing in the wrong index to create a new String.
>>>>
>>>> To test if I run (Ubuntu 16.04 desktop) i.e. the same query
concurrently
>>>> twice;
>>>>
>>>> wget -O q1 --header "accept: text/plain" "
>>>> http://localhost:8080/tdb/sparql?query=PREFIX dxf: 
>>
>> SELECT
>>>>
>>>> * WHERE { GRAPH
>>
>> 
>>>>
>>>> { { ?equipment dxf:attribute ?attribute  } . OPTIONAL {
>>>> ?tier_two_containment dxf:no_such_property ?equipment ;
>>>> dxf:still_no_such_property ?department_area }}}" & wget -O q2 --header
>>>> "accept: text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX
dxf:
>>>>  SELECT * WHERE { GRAPH
>>>>  { {
?equipment
>>>> dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
>>>> dxf:no_such_property ?equipment ; dxf:still_no_such_property
>>>> ?department_area }}}"
>>>>
>>>>
>>>> the q1 output will show the correct result but q2 will either be 0
bytes
>>
>> or
>>>>
>>>> truncated and the following exception is show.
>>>>
>>>> dick@Dick-M3800:/media/dick/Backup1/bc$ wget -O q1 --header "accept:
>>>> text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX dxf:
>>>>  SELECT * WHERE { GRAPH
>>>>  { {
?equipment
>>>> dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
>>>> dxf:no_such_property ?equipment ; dxf:still_no_such_property
>>>> ?department_area }}}" & wget -O q2 --header "accept: text/plain" "
>>>> http://localhost:8080/tdb/sparql?query=PREFIX dxf: 
>>
>> SELECT
>>>>
>>>> * WHERE { 

Re: Concurrent read with unmatched optional clause causes exception in Jena 3.1.

2016-10-17 Thread Dick Murray
On 17 Oct 2016 21:33, "Andy Seaborne" <a...@apache.org> wrote:
>
>
>
> On 17/10/16 21:13, Dick Murray wrote:
>>
>> Hi.
>>
>> On 17 Oct 2016 18:16, "Andy Seaborne" <a...@apache.org> wrote:
>>>
>>>
>>> Are you running with debug tracing on?
>>
>>
>> No, should I and what should I look for..?
>
>
> I asked because that point in the code ...
>
>
>
> >>>  [java]
org.apache.jena.atlas.iterator.Iter.asString(Iter.java:479)
> >>>  [java]
> >>>
> >
org.apache.jena.sparql.engine.optimizer.reorder.ReorderTransformationSubstitution.reorder(ReorderTransformationSubstitution.java:85)
> >>>
>
> Object field:
> private final boolean DEBUG = log.isDebugEnabled() ;
>
> then line 85:
>
>  if ( DEBUG )
>log.debug("Reorder: "+Iter.asString(components, formatter)) ;
>
> The formatter is, mistakenly, shared because it an object field.
>
> But to get there, DEBUG has to be true.
>
> So it seems log.isDebugEnabled() is true.

I'm not setting it, OK, I'm not knowingly setting it.

I'll put a breakpoint in Eclipse and try and see where it is being set...

>
> Andy
>
>
>
>
>>
>>>
>>> Andy
>>>
>>>
>>> On 17/10/16 11:30, Dick Murray wrote:
>>>>
>>>>
>>>> Hi.
>>>>
>>>> I'm getting odd behaviour in Jena when I execute the same query
>>>> concurrently.
>>>>
>>>> The query has an optional which is unmatched but which appears to
cause a
>>>> java.lang.String exception from the atlas code.
>>>>
>>>> This only happens if multiple queries are submitted concurrently and
>>>> closely. On a "fast" host both queries will typically fail but on a
>>
>> slower
>>>>
>>>> host (VirtualBox with throttled cpu) the first query will return but
the
>>>> second will fail.
>>>>
>>>> If I siege with 10 threads and a random delay of 2 seconds some queries
>>>> will succeed. Which makes it seem like a timing issue.
>>>>
>>>> The crux appears to be the optional, remove it and I can siege (10
>>
>> threads,
>>>>
>>>> 50 iterations) without any issues, with the optional I get exceptions.
>>>>
>>>> I originally thought it was the application/json format but returning
>>>> text/plain causes the exception and the root cause appears to be
>>
>> something
>>>>
>>>> in the atlas code passing in the wrong index to create a new String.
>>>>
>>>> To test if I run (Ubuntu 16.04 desktop) i.e. the same query
concurrently
>>>> twice;
>>>>
>>>> wget -O q1 --header "accept: text/plain" "
>>>> http://localhost:8080/tdb/sparql?query=PREFIX dxf: 
>>
>> SELECT
>>>>
>>>> * WHERE { GRAPH
>>
>> 
>>>>
>>>> { { ?equipment dxf:attribute ?attribute  } . OPTIONAL {
>>>> ?tier_two_containment dxf:no_such_property ?equipment ;
>>>> dxf:still_no_such_property ?department_area }}}" & wget -O q2 --header
>>>> "accept: text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX
dxf:
>>>>  SELECT * WHERE { GRAPH
>>>>  { {
?equipment
>>>> dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
>>>> dxf:no_such_property ?equipment ; dxf:still_no_such_property
>>>> ?department_area }}}"
>>>>
>>>>
>>>> the q1 output will show the correct result but q2 will either be 0
bytes
>>
>> or
>>>>
>>>> truncated and the following exception is show.
>>>>
>>>> dick@Dick-M3800:/media/dick/Backup1/bc$ wget -O q1 --header "accept:
>>>> text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX dxf:
>>>>  SELECT * WHERE { GRAPH
>>>>  { {
?equipment
>>>> dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
>>>> dxf:no_such_property ?equipment ; dxf:still_no_such_property
>>>> ?department_area }}}" & wget -O q2 --header "accept: text/plain" "
>>>> http://localhost:8080/tdb/sparql?query=PREFIX dxf: 
>>
>> SELECT
>>>>
>>>> * WHERE { GRAPH
>>
>> 
>>>>
>>>> { { ?equipment dxf:attribute ?attribute  } . OPTIONAL {
>>>> ?tier_two_containment

Concurrent read with unmatched optional clause causes exception in Jena 3.1.

2016-10-17 Thread Dick Murray
Hi.

I'm getting odd behaviour in Jena when I execute the same query
concurrently.

The query has an optional which is unmatched but which appears to cause a
java.lang.String exception from the atlas code.

This only happens if multiple queries are submitted concurrently and
closely. On a "fast" host both queries will typically fail but on a slower
host (VirtualBox with throttled cpu) the first query will return but the
second will fail.

If I siege with 10 threads and a random delay of 2 seconds some queries
will succeed. Which makes it seem like a timing issue.

The crux appears to be the optional, remove it and I can siege (10 threads,
50 iterations) without any issues, with the optional I get exceptions.

I originally thought it was the application/json format but returning
text/plain causes the exception and the root cause appears to be something
in the atlas code passing in the wrong index to create a new String.

To test if I run (Ubuntu 16.04 desktop) i.e. the same query concurrently
twice;

wget -O q1 --header "accept: text/plain" "
http://localhost:8080/tdb/sparql?query=PREFIX dxf:  SELECT
* WHERE { GRAPH 
{ { ?equipment dxf:attribute ?attribute  } . OPTIONAL {
?tier_two_containment dxf:no_such_property ?equipment ;
dxf:still_no_such_property ?department_area }}}" & wget -O q2 --header
"accept: text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX dxf:
 SELECT * WHERE { GRAPH
 { { ?equipment
dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
dxf:no_such_property ?equipment ; dxf:still_no_such_property
?department_area }}}"


the q1 output will show the correct result but q2 will either be 0 bytes or
truncated and the following exception is show.

dick@Dick-M3800:/media/dick/Backup1/bc$ wget -O q1 --header "accept:
text/plain" "http://localhost:8080/tdb/sparql?query=PREFIX dxf:
 SELECT * WHERE { GRAPH
 { { ?equipment
dxf:attribute ?attribute  } . OPTIONAL { ?tier_two_containment
dxf:no_such_property ?equipment ; dxf:still_no_such_property
?department_area }}}" & wget -O q2 --header "accept: text/plain" "
http://localhost:8080/tdb/sparql?query=PREFIX dxf:  SELECT
* WHERE { GRAPH 
{ { ?equipment dxf:attribute ?attribute  } . OPTIONAL {
?tier_two_containment dxf:no_such_property ?equipment ;
dxf:still_no_such_property ?department_area }}}"
[1] 7594
--2016-10-17 11:03:21--
http://localhost:8080/tdb/sparql?query=PREFIX%20dxf:%20%3Curn:iungo:dxf/%3E%20SELECT%20*%20WHERE%20%7B%20GRAPH%20%3Curn:iungo:dxf/graph/309ea4ce-dbdf-4d92-9828-1d99d35d0bb4%3E%20%7B%20%7B%20?equipment%20dxf:attribute%20?attribute%20%20%7D%20.%20OPTIONAL%20%7B%20?tier_two_containment%20dxf:no_such_property%20?equipment%20;%20dxf:still_no_such_property%20?department_area%20%7D%7D%7D
Resolving localhost (localhost)... --2016-10-17 11:03:21--
http://localhost:8080/tdb/sparql?query=PREFIX%20dxf:%20%3Curn:iungo:dxf/%3E%20SELECT%20*%20WHERE%20%7B%20GRAPH%20%3Curn:iungo:dxf/graph/309ea4ce-dbdf-4d92-9828-1d99d35d0bb4%3E%20%7B%20%7B%20?equipment%20dxf:attribute%20?attribute%20%20%7D%20.%20OPTIONAL%20%7B%20?tier_two_containment%20dxf:no_such_property%20?equipment%20;%20dxf:still_no_such_property%20?department_area%20%7D%7D%7D
Resolving localhost (localhost)... 127.0.0.1127.0.0.1

Connecting to localhost (localhost)|127.0.0.1|:8080... Connecting to
localhost (localhost)|127.0.0.1|:8080... connected.
connected.
HTTP request sent, awaiting response... HTTP request sent, awaiting
response... 200 OK
200 OK
Length: Length: 00 [text/plain]
 [text/plain]
Saving to: ‘q1’
Saving to: ‘q2’


q1  [ <=>
 ]   0  --.-KB/sin
0s

q2  [ <=>
 ]   0  --.-KB/sin
0s

2016-10-17 11:03:22 (0.00 B/s) - ‘q1’ saved [0/0]

2016-10-17 11:03:22 (0.00 B/s) - ‘q2’ saved [0/0]


and an exception is thrown from java.lang.String

Exception java.lang.StringIndexOutOfBoundsException: String index out of
range: 247
 [java] java.lang.String.(String.java:205)
 [java] java.lang.StringBuilder.toString(StringBuilder.java:407)
 [java] org.apache.jena.atlas.iterator.AccString.get(AccString.java:52)
 [java] org.apache.jena.atlas.iterator.AccString.get(AccString.java:21)
 [java] org.apache.jena.atlas.iterator.Iter.reduce(Iter.java:165)
 [java] org.apache.jena.atlas.iterator.Iter.asString(Iter.java:483)
 [java] org.apache.jena.atlas.iterator.Iter.asString(Iter.java:479)
 [java]
org.apache.jena.sparql.engine.optimizer.reorder.ReorderTransformationSubstitution.reorder(ReorderTransformationSubstitution.java:85)
 [java]
org.apache.jena.sparql.engine.optimizer.reorder.ReorderTransformationSubstitution.reorderIndexes(ReorderTransformationSubstitution.java:69)
 [java]
org.apache.jena.tdb.solver.OpExecutorTDB1.reorder(OpExecutorTDB1.java:276)
 [java]
org.apache.jena.tdb.solver.OpExecutorTDB1.optimizeExecuteQuads(OpExecutorTDB1.java:230)
 [java]

Stall when committing a write transaction.

2016-08-08 Thread Dick Murray
im add#9
[Commit [19] [475000]] [2016-08-08T10:04:54.628Z]
TIME_STOPPED 2016-08-08T10:04:57.341Z AutocommitDatasetGraphShim add#9
[Commit [19] [475000]] [2016-08-08T10:04:57.341Z]/[PT2.713S]
TIME_STARTED 2016-08-08T10:04:57.591Z AutocommitDatasetGraphShim add#9
[Write [19]] [2016-08-08T10:04:57.591Z]
TIME_STOPPED 2016-08-08T10:04:57.592Z AutocommitDatasetGraphShim add#9
[Write [19]] [2016-08-08T10:04:57.592Z]/[PT0.001S]
TIME_STARTED 2016-08-08T10:04:58.339Z AutocommitDatasetGraphShim add#9
[Commit [20] [50]] [2016-08-08T10:04:58.339Z]
TIME_STOPPED 2016-08-08T10:05:01.459Z AutocommitDatasetGraphShim add#9
[Commit [20] [50]] [2016-08-08T10:05:01.459Z]/[PT3.12S]
TIME_STARTED 2016-08-08T10:05:01.709Z AutocommitDatasetGraphShim add#9
[Write [20]] [2016-08-08T10:05:01.709Z]
TIME_STOPPED 2016-08-08T10:05:01.710Z AutocommitDatasetGraphShim add#9
[Write [20]] [2016-08-08T10:05:01.710Z]/[PT0.001S]
DEBUG 2016-08-08T10:05:04.216Z RootLogger Thread-24 take()
METHOD_RETURN 2016-08-08T10:05:04.216Z AutocommitDatasetGraphShim apply#11
Return Timer [Complete
[2016-08-08T10:03:13.285Z]/[2016-08-08T10:05:04.216Z]/[PT1M50.931S]] Value
[org.iungo.result.Result]



*Dick Murray*
Technology Specialist



*Business Collaborator Limited*
9th Floor, Reading Bridge House, George Street, Reading, RG1 8LS, United
Kingdom

T 0044 7884 111729 *|* E dick.mur...@groupbc.com
<alistair.wa...@groupbc.com> *|* Twitter @BusCollaborator
<https://twitter.com/BusCollaborator> *|* http://www.groupbc.com


Re: Jena TDB OOME GC overhead limit exceeded

2016-07-27 Thread Dick Murray
The MWE in the previous email will work with any even line text file and
will produce the odd [B values. I can't see anywhere obvious where the non
Jena code is creating them, just odd that there's so many of them!

OK, that knocks the DBB idea on the head!

I'll set the mapped symbol and play with batch sizes. Can the map location
be configured or will it go after the TDB location?

Is "TDB2" what we discussed some time back? I'm happy to provide some
testing on that as I've ~2000 files to ETL via an automated process each
producing 3-4M quads...

Thanks Dick.

On 27 Jul 2016 20:10, "Andy Seaborne" <a...@apache.org> wrote:
>
> On 27/07/16 13:19, Dick Murray wrote:
>>
>> ;-) Yes I did. But then I switched to the actual files I need to import
and
>> they produce ~3.5M triples...
>>
>> Using normal Jena 3.1 (i.e. no special context symbols set) the commit
>> after 100k triples works to import the file 10 times with the [B varying
>> between ~2Mb and ~4Mb. I'm currently testing a 20 instance pass.
>>
>> A batched commit works for this bulk load because if it fails after a
batch
>> commit I can remove the graph.
>>
>> For my understanding... TDB is holding the triples/block/journal in heap
>> until commit is called? But this doesn't account for the [B not being
>> cleared after a commit of 3.5M triples. It takes another pass plus ~2M
>> uncommited triples before I get an OOME.
>
>
> And the [B have a strange average size.  A block is 8K.
>
>> Digging around and there are some references made to the
DirectByteBuffers
>> causing issues. IBM
>>
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/excessive_native_memory_usage_by_directbytebuffers?lang=en
>> links the problem to;
>>
>> Essentially the problem boils down to either:
>>
>>1. There are too many DBBs being allocated (or they are too large),
>>and/or
>>2. The DBBs are not being cleared up quickly enough.
>>
>
> TDB does not use DirectByteBuffers unless you ask it to.  They are not [B.
>
> .hasArray is false.
> .array() throws UnsupportedOperationException.
>
> (Grep the code for "allocateDirect" and trace back the use of single use
of BufferAllocatorDirect to the journal in "direct" mode.)
>
> I can believe that if activated, that the GC recycling would be slow. The
code ought to recycle them (beause yuo can't explicitly free them for some
weird reason - they little more than malloc).
>
> But they are not being used unless you ask for them.
>
> Journal entries are 8K unless they are commit records which are about 20
bytes (I think).
>
>
>
>>
>> and recommends using -XX:MaxDirectMemorySize=1024m to poke the GC via
>> System.gc(). Not sure if GC1C helps because of it's new heap model...
>>
>> Would it be possible to get Jena to write it's uncommitted triples to
disk
>> and then commit them to the TDB?
>
>
> Set TDB.transactionJournalWriteBlockMode to "mapped". That uses a disk
file.
>
>
>> Ok it's slower than RAM but until they are
>> committed only one thread has visibility anyway? Could direct that at a
>> different disk as well...
>>
>> Just before hitting send I'm at pass 13 and the [B maxed at just over 4Gb
>> before dropping back to 2Gb.
>
>
> Or use TDB2 :-)
>
> It has no problem loading 100m+ triples in a single transaction (the
space per transaction is fixed at about 80 bytes of transaction - disk
writes happen during the transaction not to a roll-forward journal). And it
should be a bit faster because writes happen once.
>
> Just need to find time to clean it up ...
>
> Andy
>
>
>>
>> Dick.
>>
>>
>>
>> On 27 July 2016 at 11:47, Andy Seaborne <a...@apache.org> wrote:
>>
>>> On 27/07/16 11:22, Dick Murray wrote:
>>>
>>>> Hello.
>>>>
>>>> Something doesn't add up here... I've run repeated tests with the
>>>> following
>>>> MWE on a 16GB machine with -Xms8g -Xmx8g and the I always get an OOME.
>>>>
>>>> What I don't understand is the size of [B increases with each pass
until
>>>> the OOME is thrown. The exact same process is run 5 times with a new
graph
>>>> for each set of triples.
>>>>
>>>> There are ~3.5M triples added within the transaction from a file which
is
>>>> a
>>>> "simple" text based file (30Mb) which is read in line pairs.
>>>>
>>>
>>> Err - you said 200k quads earlier!
>>>
>>> Set
>>>
>>> TransactionManager.QueueBatch

Re: Jena TDB OOME GC overhead limit exceeded

2016-07-27 Thread Dick Murray
;-) Yes I did. But then I switched to the actual files I need to import and
they produce ~3.5M triples...

Using normal Jena 3.1 (i.e. no special context symbols set) the commit
after 100k triples works to import the file 10 times with the [B varying
between ~2Mb and ~4Mb. I'm currently testing a 20 instance pass.

A batched commit works for this bulk load because if it fails after a batch
commit I can remove the graph.

For my understanding... TDB is holding the triples/block/journal in heap
until commit is called? But this doesn't account for the [B not being
cleared after a commit of 3.5M triples. It takes another pass plus ~2M
uncommited triples before I get an OOME.

Digging around and there are some references made to the DirectByteBuffers
causing issues. IBM
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/excessive_native_memory_usage_by_directbytebuffers?lang=en
links the problem to;

Essentially the problem boils down to either:

   1. There are too many DBBs being allocated (or they are too large),
   and/or
   2. The DBBs are not being cleared up quickly enough.


and recommends using -XX:MaxDirectMemorySize=1024m to poke the GC via
System.gc(). Not sure if GC1C helps because of it's new heap model...

Would it be possible to get Jena to write it's uncommitted triples to disk
and then commit them to the TDB? Ok it's slower than RAM but until they are
committed only one thread has visibility anyway? Could direct that at a
different disk as well...

Just before hitting send I'm at pass 13 and the [B maxed at just over 4Gb
before dropping back to 2Gb.

Dick.



On 27 July 2016 at 11:47, Andy Seaborne <a...@apache.org> wrote:

> On 27/07/16 11:22, Dick Murray wrote:
>
>> Hello.
>>
>> Something doesn't add up here... I've run repeated tests with the
>> following
>> MWE on a 16GB machine with -Xms8g -Xmx8g and the I always get an OOME.
>>
>> What I don't understand is the size of [B increases with each pass until
>> the OOME is thrown. The exact same process is run 5 times with a new graph
>> for each set of triples.
>>
>> There are ~3.5M triples added within the transaction from a file which is
>> a
>> "simple" text based file (30Mb) which is read in line pairs.
>>
>
> Err - you said 200k quads earlier!
>
> Set
>
> TransactionManager.QueueBatchSize=0 ;
>
> and break the load into small units for now and see if that helps.
>
> One experiment would be to write the output to disk and load from a
> program that only does the TDB part.
>
> Andy
>
>
>
>> I've tested sequential loads of other text files (i.e. file x *5) and
>> other
>> text files loaded sequentally (i.e. file x, file y, file ...) and the same
>> result is exhibited.
>>
>> If I reduce -Xmx to 6g it will fail earlier.
>>
>> Changing the GC using -XX:+UseGC1C doesn't alter the outcome.
>>
>> I'm running on Ubuntu 16.04 with Java 1.8 and I can replicate this on
>> Centos 7 with Java 1.8.
>>
>> Any ideas?
>>
>> Regards Dick.
>>
>>
>
>


Re: Jena TDB OOME GC overhead limit exceeded

2016-07-27 Thread Dick Murray
-
   1:   1319938 7682030072  [B
   2:   1702288   81709824  java.nio.HeapByteBuffer
   3:   1281483   41007456  java.util.HashMap$Node
   4:770721   36994608
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   5:962343   30794976  org.apache.jena.tdb.base.block.Block
   6:737804   28430888  [C
   7:   310   27899112  [Ljava.util.HashMap$Node;
   8:  1834   27200896  [I
   9:935394   22449456  java.lang.Long
  10:328196   18378976  java.nio.ByteBufferAsIntBufferB
2016-07-27T09:52:40.574Z 3584944 commit
2016-07-27T09:52:47.761Z 312210256 8589934592 8589934592
2016-07-27T09:52:47.761Z jmap

 num #instances #bytes  class name
--
   1:   1319955 7682063456  [B
   2:   1702291   81709968  java.nio.HeapByteBuffer
   3:   1281483   41007456  java.util.HashMap$Node
   4:770723   36994704
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   5:962350   30795200  org.apache.jena.tdb.base.block.Block
   6:739988   28854128  [C
   7:  1847   28725888  [I
   8:   310   27899112  [Ljava.util.HashMap$Node;
   9:935404   22449696  java.lang.Long
  10:328196   18378976  java.nio.ByteBufferAsIntBufferB
2016-07-27T09:52:48.404Z end
2016-07-27T09:52:48.404Z Pass 5
2016-07-27T09:52:48.404Z 312210256 8589934592 8589934592
2016-07-27T09:52:48.404Z jmap

 num #instances #bytes  class name
--
   1:   1319967 7682096520  [B
   2:   1702292   81710016  java.nio.HeapByteBuffer
   3:   1281483   41007456  java.util.HashMap$Node
   4:770723   36994704
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   5:962350   30795200  org.apache.jena.tdb.base.block.Block
   6:742126   29280704  [C
   7:  1849   28158216  [I
   8:   310   27899112  [Ljava.util.HashMap$Node;
   9:935412   22449888  java.lang.Long
  10:328196   18378976  java.nio.ByteBufferAsIntBufferB
2016-07-27T09:52:49.082Z begin WRITE

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "main"


On 26 July 2016 at 17:31, Andy Seaborne <a...@apache.org> wrote:

> On 26/07/16 16:51, Dick Murray wrote:
>
>> Ok, I set that option and I get different OOME from the direct buffer
>> memory.
>>
>
> You now have:
>
> > java.lang.OutOfMemoryError:
> > Direct buffer memory
>
> So that means the direct memory space has run out, not the heap.
>
> You can increase direct memory but that isn't getting to the root cause.
>
> 200k quads isn't that many, though the JVM arg "-Xmx4g" is pretty small
> for largish transactions.  But you are doing it 5 times.  I calculated
> maybe 300M-1G of temp space but the calculation is a bit "it depends". But
> that is 25% of the heap allocated already.
>
> Try to switch off the commit amalgamation.  Commits get aggregated so many
> one triple insert transactions don't cause vast overheads.
>
> TransactionManager.QueueBatchSize=0 ;
>
> In addition, the unusual size (for TDB) of many byte[] and other objects
> suggests that the non Jena code is using a significant slice of space as
> well.
>
> (without setting TDB.transactionJournalWriteBlockMode).
>
> Not setting TDB.transactionJournalWriteBlockMode is better but an
> alternative is "mapped" mode instead of "direct".
>
> Andy
>
>
>> I then changed the GC using -XX:+UseG1GC (which still throws the OOME) and
>> I don't get why it's throwing the OOME.!?
>>
>>  [Eden: 2372.0M(2372.0M)->0.0B(1036.0M) Survivors: 84.0M->196.0M Heap:
>> 2979.4M(4096.0M)->720.4M(4096.0M)]
>>
>> Unless I'm mistaken I don't have a G1 Heap problem?
>>
>>
>> ...
>> [GC pause (G1 Evacuation Pause) (young), 0.1580047 secs]
>>[Parallel Time: 149.8 ms, GC Workers: 8]
>>   [GC Worker Start (ms): Min: 463499.3, Avg: 463499.4, Max: 463499.5,
>> Diff: 0.1]
>>   [Ext Root Scanning (ms): Min: 0.2, Avg: 0.2, Max: 0.3, Diff: 0.2,
>> Sum: 1.8]
>>   [Update RS (ms): Min: 12.5, Avg: 12.6, Max: 12.7, Diff: 0.2, Sum:
>> 100.9]
>>  [Processed Buffers: Min: 13, Avg: 15.6, Max: 18, Diff: 5, Sum:
>> 125]
>>   [Scan RS (ms): Min: 18.7, Avg: 18.8, Max: 18.8, Diff: 0.2, Sum:
>> 150.2]
>>   [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>> Sum: 0.1]
>>   [Object Copy (ms): Min: 116.7, Avg: 116.7, Max: 116.8, Diff: 0.1,
>

Re: Jena TDB OOME GC overhead limit exceeded

2016-07-26 Thread Dick Murray
ies.Attrib.accept(Attrib.java:36)
org.iungo.dataset.bulkload.DXFDatasetBulkload$2.visit(DXFDatasetBulkload.java:379)
org.kabeja.entities.Insert.accept(Insert.java:56)
org.iungo.dataset.bulkload.DXFDatasetBulkload$2.visit(DXFDatasetBulkload.java:282)
org.kabeja.common.Layer.accept(Layer.java:61)
org.iungo.dataset.bulkload.DXFDatasetBulkload$2.visit(DXFDatasetBulkload.java:193)
org.kabeja.DraftDocument.accept(DraftDocument.java:100)
org.iungo.dataset.bulkload.DXFDatasetBulkload$6.get(DXFDatasetBulkload.java:504)
org.iungo.dataset.bulkload.DXFDatasetBulkload$6.get(DXFDatasetBulkload.java:1)
org.iungo.logger.MethodLogger.call(MethodLogger.java:104)
org.iungo.dataset.bulkload.DXFDatasetBulkload.bulkload(DXFDatasetBulkload.java:501)
org.iungo.dataset.node.DatasetNode$1$1.accept(DatasetNode.java:222)
org.iungo.dataset.node.DatasetNode$1$1.accept(DatasetNode.java:1)
org.iungo.queue.Q.process(Q.java:64)
org.iungo.queue.Q$1.run(Q.java:45)
java.lang.Thread.run(Thread.java:745)
Heap
 PSYoungGen  total 1325056K, used 139293K [0x00076ab0,
0x0007c000, 0x0007c000)
  eden space 1251328K, 11% used
[0x00076ab0,0x000773307780,0x0007b710)
  from space 73728K, 0% used
[0x0007bb80,0x0007bb80,0x0007c000)
  to   space 72704K, 0% used
[0x0007b710,0x0007b710,0x0007bb80)
 ParOldGen   total 2796544K, used 332700K [0x0006c000,
0x00076ab0, 0x00076ab0)
  object space 2796544K, 11% used
[0x0006c000,0x0006d44e71a8,0x00076ab0)
 Metaspace   used 18734K, capacity 19032K, committed 19328K, reserved
1067008K
  class spaceused 2265K, capacity 2366K, committed 2432K, reserved
1048576K


On 26 July 2016 at 14:40, Andy Seaborne <a...@apache.org> wrote:

> To build clean locally do the following:
>
> at the top level: not jena-tdb
>
> mvn clean install -Pbootstrap
> mvn install -Pdev
>
> (or "mvn clean install" but that builds and tests a lot more)
>
> Andy
>
>
>
> On 26/07/16 14:25, Andy Seaborne wrote:
>
>> On 26/07/16 14:11, Dick Murray wrote:
>>
>>> Hi.
>>>
>>> I'll set that and run the process again.
>>>
>>> As an aside I just pulled Master and TDB won't compile because it's can't
>>> find MultiSet? Are there notes on getting the Jena GIT into Eclipse? I
>>> want
>>> to put a count on the BufferAllocatorMem to see what it's doing. I've
>>> put a
>>> break point on but gave up on F8 after counting to 500
>>>
>>> Dick.
>>>
>>
>> Make sure you have all the dependencies successfully resolved with
>>  mvn -o dependency:tree.
>>
>> The Apache snapshot repo was having a bad day earlier and Multiset is
>> from org.apache.jena:jena-shaded-guava:jar
>>
>> Andy
>>
>>
>>> On 26 July 2016 at 11:05, Andy Seaborne <a...@apache.org> wrote:
>>>
>>> On 26/07/16 10:51, Dick Murray wrote:
>>>>
>>>> Hi.
>>>>>
>>>>> Where do you set "transactionJournalWriteBlockMode" please?
>>>>>
>>>>>
>>>> We don't - its off by default.
>>>>
>>>> It's a symbol you can set in the global context.
>>>> TDB.transactionJournalWriteBlockMode
>>>>
>>>> It is the only place that triggers DirectByteBuffers in TDB which I
>>>> see in
>>>> your jmap output.
>>>>
>>>>
>>>> Would you expect to see a large number of [B heap entries in a 3.1 TDB?
>>>>>
>>>>>
>>>> Probably not, particularly not retained ones.
>>>>
>>>> Looking at the average sizes:
>>>>
>>>>   1:132187  636210296  [B
>>>>>>>
>>>>>> Average size: 4812.9
>>>>
>>>>>   1:   1148534 1420727464  [B
>>>>>>>
>>>>>> Average size: 1237.0
>>>>
>>>>>   1:   1377821 2328657400  [B
>>>>>>>
>>>>>> Average size: 1690.101544395099218258394958
>>>>
>>>>>   1:333123 2285460248  [B
>>>>>>>
>>>>>> Average size: 6860.7
>>>>
>>>>>   1:333123 2285460248  [B
>>>>>>>
>>>>>> Average size: 6860.7
>>>>
>>>>>   1:333123 2285460248  [B
>>>>>>>
>>>>>> Average size: 6860.7
>>>>
>>>>>   1:934984 3083070024  [B
>>>>>>>

Re: Jena TDB OOME GC overhead limit exceeded

2016-07-26 Thread Dick Murray
Hi.

Where do you set "transactionJournalWriteBlockMode" please?

Would you expect to see a large number of [B heap entries in a 3.1 TDB?

Dick.

On 26 July 2016 at 10:39, Andy Seaborne <a...@apache.org> wrote:

> Dick,
>
> The report is embedded in your application setup with a lot of
> "org.iungo.dataset.bulkload"
>
> Just because the OOME occurs in TDB does not mean that the space is
> consumed in TDB - there may be a bigger memory hog elsewhere.
>
> Could you produce an RDF example?
>
> Maybe that file, already converted to RDF, and loaded with tdbloader?
>
> If TDB is using DiretcByteBuffersm have you set
> "transactionJournalWriteBlockMode" to "direct"?
>
> You need to increase the direct memory space, not the heap.
>
> Andy
>
>
> On 26/07/16 10:14, Dick Murray wrote:
>
>> Hi.
>>
>> I've got a repeatable problem with Jena 3.1 when performing a bulk load.
>>
>> The bulk load converts a file into ~200k quads and adds them to a TDB
>> instance within a normal begin write, add quads and commit. Initially this
>> completes in 30-40 seconds, However if I repeat the process (with the same
>> file) on the 5th iteration I get a OOME exception. As I'm importing the
>> same file into different graphs I would expect the DAT file to stay the
>> same size after the first process and just the index files to grow.
>>
>> There are no other process running against the TDB whilst this process is
>> run.
>>
>> Using jmap the [B grows with each process until finally I get the
>> exception.
>>
>> If I increase the Xmx the OOME occurs later.
>>
>> Any ideas?
>>
>> I've included details below, including jmap output which shows the heap
>> being used and the JVM output which shows the GC (Allocation Failure)
>> entries transiting to Full GC (Economics) entries...
>>
>> Regards Dick.
>>
>> JVM args -Xms2g -Xmx4g -XX:+PrintGCDetails
>> -javaagent:/home/dick/eclipse-jee-neon/opt/classmexer-0_03/classmexer.jar
>>
>> dick@Dick-M3800:~$ uname -a
>> Linux Dick-M3800 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC
>> 2016 x86_64 x86_64 x86_64 GNU/Linux
>> dick@Dick-M3800:~$ java -version
>> openjdk version "1.8.0_91"
>> OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14)
>> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>> dick@Dick-M3800:~$
>>
>>
>> Output from jmap
>>
>> dick@Dick-M3800:~$ jmap -histo 15031 | head -25
>>
>>  num #instances #bytes  class name
>> --
>>1:132187  636210296  [B
>>2:500242   31189608  [C
>>3:533380   25602240
>>  org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
>>4:468220   18728800  org.kabeja.math.Point3D
>>5:349351   16768848  org.kabeja.entities.Vertex
>>6:  1654   16374552  [I
>>7:445589   16219672  [Ljava.lang.Object;
>> dick@Dick-M3800:~$ jmap -histo 15031 | head -25
>>
>>  num #instances #bytes  class name
>> --
>>1:   1148534 1420727464  [B
>>2:   5961841  344220520  [C
>>3:   1335412   85466368  java.nio.DirectByteBuffer
>>4:   3453219   82877256  java.lang.String
>>5:573585   65399360  [I
>>6:   1261244   60539712
>>  org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
>>7:945955   36365560  [Ljava.lang.Object;
>> dick@Dick-M3800:~$ jmap -histo 15031 | head -25
>>
>>  num #instances #bytes  class name
>> --
>>1:   1377821 2328657400  [B
>>2:   7566951  434495472  [C
>>3:   1717606  109926784  java.nio.DirectByteBuffer
>>4:   4339997  104159928  java.lang.String
>>5:749581   75578568  [I
>>6:   1485127   71286096
>>  org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
>>7:   1089303   42230696  [Ljava.lang.Object;
>> dick@Dick-M3800:~$ jmap -histo 15031 | head -25
>>
>>  num #instances #bytes  class name
>> --
>>1:333123 2285460248  [B
>>2:604102   38062832  [C
>>3:660301   31694448
>>  org.ap

Jena TDB OOME GC overhead limit exceeded

2016-07-26 Thread Dick Murray
Hi.

I've got a repeatable problem with Jena 3.1 when performing a bulk load.

The bulk load converts a file into ~200k quads and adds them to a TDB
instance within a normal begin write, add quads and commit. Initially this
completes in 30-40 seconds, However if I repeat the process (with the same
file) on the 5th iteration I get a OOME exception. As I'm importing the
same file into different graphs I would expect the DAT file to stay the
same size after the first process and just the index files to grow.

There are no other process running against the TDB whilst this process is
run.

Using jmap the [B grows with each process until finally I get the exception.

If I increase the Xmx the OOME occurs later.

Any ideas?

I've included details below, including jmap output which shows the heap
being used and the JVM output which shows the GC (Allocation Failure)
entries transiting to Full GC (Economics) entries...

Regards Dick.

JVM args -Xms2g -Xmx4g -XX:+PrintGCDetails
-javaagent:/home/dick/eclipse-jee-neon/opt/classmexer-0_03/classmexer.jar

dick@Dick-M3800:~$ uname -a
Linux Dick-M3800 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux
dick@Dick-M3800:~$ java -version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
dick@Dick-M3800:~$


Output from jmap

dick@Dick-M3800:~$ jmap -histo 15031 | head -25

 num #instances #bytes  class name
--
   1:132187  636210296  [B
   2:500242   31189608  [C
   3:533380   25602240
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   4:468220   18728800  org.kabeja.math.Point3D
   5:349351   16768848  org.kabeja.entities.Vertex
   6:  1654   16374552  [I
   7:445589   16219672  [Ljava.lang.Object;
dick@Dick-M3800:~$ jmap -histo 15031 | head -25

 num #instances #bytes  class name
--
   1:   1148534 1420727464  [B
   2:   5961841  344220520  [C
   3:   1335412   85466368  java.nio.DirectByteBuffer
   4:   3453219   82877256  java.lang.String
   5:573585   65399360  [I
   6:   1261244   60539712
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   7:945955   36365560  [Ljava.lang.Object;
dick@Dick-M3800:~$ jmap -histo 15031 | head -25

 num #instances #bytes  class name
--
   1:   1377821 2328657400  [B
   2:   7566951  434495472  [C
   3:   1717606  109926784  java.nio.DirectByteBuffer
   4:   4339997  104159928  java.lang.String
   5:749581   75578568  [I
   6:   1485127   71286096
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   7:   1089303   42230696  [Ljava.lang.Object;
dick@Dick-M3800:~$ jmap -histo 15031 | head -25

 num #instances #bytes  class name
--
   1:333123 2285460248  [B
   2:604102   38062832  [C
   3:660301   31694448
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   4:468220   18728800  org.kabeja.math.Point3D
   5:349351   16768848  org.kabeja.entities.Vertex
   6:445689   16486104  [Ljava.lang.Object;
   7:590752   14178048  java.lang.String
   8:278273   13357104  java.nio.HeapByteBuffer
   9:530557   12733368  org.apache.jena.tdb.store.NodeId
  10:   420   11221544  [Ljava.util.HashMap$Node;
  11:334514   10704448  java.util.HashMap$Node
  12:660301   10564816
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongValueReference
  13:420443   10090632  org.kabeja.tools.LazyContainer
  14:2782498903968  org.apache.jena.tdb.base.block.Block
  15:2017198068760
 org.apache.jena.graph.impl.LiteralLabelImpl
  16:2271855452440  java.lang.Long
  17:2572474115952  org.apache.jena.graph.BlankNodeId
  18:2572474115952  org.apache.jena.graph.Node_Blank
  19:  16943770384  [I
  20:2017193227504  org.apache.jena.graph.Node_Literal
  21: 171992476656  org.kabeja.entities.Attrib
  22: 915702197680  java.lang.Double
dick@Dick-M3800:~$ jmap -histo 15031 | head -10

 num #instances #bytes  class name
--
   1:333123 2285460248  [B
   2:604102   38062832  [C
   3:660301   31694448
 org.apache.jena.ext.com.google.common.cache.LocalCache$StrongAccessEntry
   4:468220   18728800  

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-04-01 Thread Dick Murray
Hi.

I've pushed up a draft to https://github.com/dick-twocows/jena-dev.git.

This has two test cases;

Echo : which will echo back the find GSPO call i.e. call find ABCD and you
will get the Quad ABCD back. This does not cache between calls.

CSV : which will transform a CSV file into Quads i.e. find GSPO will open
the CSV by mangling the G and cache against G ANY ANY ANY. This does cache
between calls i.e. the CSV is transformed once.

Will look at a simple JDBC test over the weekend if I get the time...

It has a POM so should Maven.

Comments appreciated (I've probably hard coded something).

Dick.

On 30 March 2016 at 20:39, Andy Seaborne  wrote:

> On 29/03/16 12:23, Joint wrote:
>
>>
>>
>> Yep, that's mangled. I've refactored the code into a Jena package do
>> you want me to create a patch for testing or it can be pulled from my
>> github?
>>
>>
>> Dick
>>
>
> 
> One of the things any open source project has to manage is whether
> accepting a contribution is the right thing to do - factors like who will
> maintain it come in.  Sometimes it better to have a module, sometimes it is
> better to have a related project. Jena has kept lists of related project
> before - it get out-of-date as no body wants to remove a live, albeit
> quiet, project.
>
>
> So there's two steps - understand what the code does and then whether the
> right thing to do is incorporate it.
>
> Post the github URL and we can look.
>
> For a contribution it is better if it is pushed to the project in some way
> (e.g. patch on JIRA, github PR) even if Apache Licensed.  Community over
> code.
>
> Andy
>


Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-18 Thread Dick Murray
Node)
>
>
> but that seems like a real distortion of the semantics. Seems like the
>> AFS-Dev material is more to the point here.
>>
>
>
>
> Andy, what do you think it would take to get that stuff to Jena
>> master? Do you think it is ready for that? I would be happy to
>> refactor TIM to use it instead of the stuff it currently uses in
>> o.a.j.sparql.core.mem.
>>
>
> I don't think its ready - it has not been used "in anger" and it may be
> the wrong design.  It needs trying out outside the codebase.  TIM works as
> it currently so it isn't a rush to put it in there.
>
>
> (digression:...)
>
> I was at a talk recently about high performance java and the issue of
> object churn was mentioned as being quite impactful on the GC as the heap
> size grows.  Once a long time ago, object creation was expensive ... then
> CPUS got faster and the java runtime smarter and it was less of an issue
> ... but it seems that its returning as a factor
>
> inline lambdas are apparently faster than the same code with  a class
> implementation - the compiler emits an invokedynamic for the lanmbda
>
> and Java Stream can cause a lot of short-lived objects.
>
> Andy
>
>
>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>> On Mar 15, 2016, at 7:39 AM, Dick Murray <dandh...@gmail.com> wrote:
>>>
>>> Eureka moment! It returns a new Graph of a certain type. Whereas I need
>>> the
>>> graph node to determine where the underlying data is.
>>>
>>> Cheers Dick.
>>>
>>> On 15 March 2016 at 11:28, Andy Seaborne <a...@apache.org> wrote:
>>>
>>> On 15/03/16 10:30, Dick Murray wrote:
>>>>
>>>> Sorry, supportsTransactionAbort() in AFS-Dev
>>>>> <https://github.com/afs/AFS-Dev>/src
>>>>> <https://github.com/afs/AFS-Dev/tree/master/src>/main
>>>>> <https://github.com/afs/AFS-Dev/tree/master/src/main>/java
>>>>> <https://github.com/afs/AFS-Dev/tree/master/src/main/java>/projects
>>>>> <https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects
>>>>> >/dsg2
>>>>> <
>>>>> https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
>>>>> >/
>>>>> *DatasetGraphStorage.java*
>>>>>
>>>>>
>>>> *Experimental code.*
>>>>
>>>>
>>>>
>>>> supportsTransactionAbort is in the DatasetGraph interface in Jena.
>>>>
>>>>
>>>> DatasetGraphStorage is using TransactionalLock.createMRSW
>>>>
>>>> As mentioned, it needs cooperation from the underlying thing to be able
>>>> to
>>>> do aborts and MRSW does not provide that (it's external locking).
>>>>
>>>> DatasetGraphStorage doesn't presume that the storage unit is
>>>> transactional.
>>>>
>>>> After these discussions I've decided to create a DatasetGraphOnDemand
>>>> which
>>>>
>>>>> extends DatasetGraphMap and uses Union graphs.
>>>>>
>>>>> However in DatasetGraphMap shouldn't getGraphCreate() be
>>>>> getGraphCreate(Node graphNode) as otherwise it doesn't know what to
>>>>> create?
>>>>>
>>>>>
>>>> It creates a graph - addGraph(graphNode, g) is managing the naming.
>>>> Grapsh
>>>> don't know the name used (in other places one graph can have many
>>>> names).
>>>>
>>>> DatasetGraphMap is for a collection of independent graphs to be turned
>>>> into a dataset.
>>>>
>>>> Andy
>>>>
>>>>
>>>>  @Override
>>>>>  public Graph getGraph(Node graphNode)
>>>>>  {
>>>>>  Graph g = graphs.get(graphNode) ;
>>>>>  if ( g == null )
>>>>>  {
>>>>>  g = getGraphCreate() ;
>>>>>  if ( g != null )
>>>>>  addGraph(graphNode, g) ;
>>>>>  }
>>>>>  return g ;
>>>>>  }
>>>>>
>>>>>  /** Called from getGraph when a nonexistent graph is asked for.
>>>>>   * Return null for "nothing created as a graph"
>>>>>   */
>>>>>  protected Graph getGraphCreate() { return null ; }
>>>>>
>>>>> Dick.
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread Dick Murray
Eureka moment! It returns a new Graph of a certain type. Whereas I need the
graph node to determine where the underlying data is.

Cheers Dick.

On 15 March 2016 at 11:28, Andy Seaborne <a...@apache.org> wrote:

> On 15/03/16 10:30, Dick Murray wrote:
>
>> Sorry, supportsTransactionAbort() in AFS-Dev
>> <https://github.com/afs/AFS-Dev>/src
>> <https://github.com/afs/AFS-Dev/tree/master/src>/main
>> <https://github.com/afs/AFS-Dev/tree/master/src/main>/java
>> <https://github.com/afs/AFS-Dev/tree/master/src/main/java>/projects
>> <https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects>/dsg2
>> <https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2>/
>> *DatasetGraphStorage.java*
>>
>
> *Experimental code.*
>
>
>
> supportsTransactionAbort is in the DatasetGraph interface in Jena.
>
>
> DatasetGraphStorage is using TransactionalLock.createMRSW
>
> As mentioned, it needs cooperation from the underlying thing to be able to
> do aborts and MRSW does not provide that (it's external locking).
>
> DatasetGraphStorage doesn't presume that the storage unit is transactional.
>
> After these discussions I've decided to create a DatasetGraphOnDemand which
>> extends DatasetGraphMap and uses Union graphs.
>>
>> However in DatasetGraphMap shouldn't getGraphCreate() be
>> getGraphCreate(Node graphNode) as otherwise it doesn't know what to
>> create?
>>
>
> It creates a graph - addGraph(graphNode, g) is managing the naming. Grapsh
> don't know the name used (in other places one graph can have many names).
>
> DatasetGraphMap is for a collection of independent graphs to be turned
> into a dataset.
>
> Andy
>
>
>>  @Override
>>  public Graph getGraph(Node graphNode)
>>  {
>>  Graph g = graphs.get(graphNode) ;
>>  if ( g == null )
>>  {
>>  g = getGraphCreate() ;
>>  if ( g != null )
>>  addGraph(graphNode, g) ;
>>  }
>>  return g ;
>>  }
>>
>>  /** Called from getGraph when a nonexistent graph is asked for.
>>   * Return null for "nothing created as a graph"
>>   */
>>  protected Graph getGraphCreate() { return null ; }
>>
>> Dick.
>>
>>
>


Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread Dick Murray
Sorry, supportsTransactionAbort() in AFS-Dev
<https://github.com/afs/AFS-Dev>/src
<https://github.com/afs/AFS-Dev/tree/master/src>/main
<https://github.com/afs/AFS-Dev/tree/master/src/main>/java
<https://github.com/afs/AFS-Dev/tree/master/src/main/java>/projects
<https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects>/dsg2
<https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2>/
*DatasetGraphStorage.java*

After these discussions I've decided to create a DatasetGraphOnDemand which
extends DatasetGraphMap and uses Union graphs.

However in DatasetGraphMap shouldn't getGraphCreate() be
getGraphCreate(Node graphNode) as otherwise it doesn't know what to create?

@Override
public Graph getGraph(Node graphNode)
{
Graph g = graphs.get(graphNode) ;
if ( g == null )
{
g = getGraphCreate() ;
if ( g != null )
addGraph(graphNode, g) ;
}
return g ;
}

/** Called from getGraph when a nonexistent graph is asked for.
 * Return null for "nothing created as a graph"
 */
protected Graph getGraphCreate() { return null ; }

Dick.

On 14 March 2016 at 09:56, Andy Seaborne <a...@apache.org> wrote:

> On 14/03/16 07:31, Joint wrote:
>
>>
>>
>> 
>> That doesn't read well...
>> I tested two types of triple storage both of which use a concurrent map
>> to track the graphs. The first used the TripleTable and took write locks so
>> there was one write per graph. The second used a concurrent skip list set
>> and no write locks so there is no write contention.
>> Your dev code has a method canAbort set to return false.I was wondering
>> what the idea was?
>>
>
> Where is canAbort?
> Are you looking at the Jena code or Mantis code?
> Do you mean supportsTransactionAbort?
>
> A system can't provide a proper abort unless it can reconstruct the old
> state, either by having two copies (Txn in memory does this) or a log of
> some kind (TDB does this).
>
> For example, plain synchronization MRSW locking can't provide an abort
> operation. It needs the cooperation of components to do that.
>
> Andy
>
>
>
>
>
>> Dick
>>
>>  Original message 
>> From: Andy Seaborne <a...@apache.org>
>> Date: 13/03/2016  7:54 pm  (GMT+00:00)
>> To: users@jena.apache.org
>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>DatasetGraphInMemory
>>
>> On 10/03/16 20:10, Dick Murray wrote:
>>
>>> Hi. Yes re TriTable and TripleTable. I too like the storage interface
>>> which
>>> would work for my needs and make life simpler. A few points from me.
>>> Currently I wrap an existing dsg and cache the additional tuples into
>>> what
>>> I call the deferred DSG or DDSG. The finds return a DSG iterator and a
>>> DDSG
>>> iterator.
>>>
>>> The DDSG is in memory and I have a number of concrete classes which
>>> achieve
>>> the same end.
>>>
>>> Firstly i use a Jenna core men DSG and the find handles just add tuples
>>> as
>>> required into the HexTable because i don't have a default graph, i.e.
>>> it's
>>> never referenced because i need a graph uri to find the deferred data.
>>>
>>> The second is in common I have a concurrent map which handles recording
>>> what graphs have been deferred then I either use TriTable or a concurrent
>>> set of tuples to store the graph contents. When I'm using the TriTable I
>>> acquire the write lock and add tuples. So writes can occur in parallel to
>>> different graphs. I've experimented with the concurrent set by spoofing
>>> the
>>> write and just adding the tuples I.e. no write lock contention per
>>> graph. I
>>> notice the datatsetgraphstorage
>>>
>>
>> 
>>
>> does not support txn abort? This gives an
>>> in memory DSG which doesn't have lock contention because it never
>>> locks...
>>> This is applicable in some circumstances and I think that the right
>>> deferred tuples is one of them?
>>>
>>> I also coded a DSG which supports a reentrerant RW with upgrade lock
>>> which
>>> allowed me to combine the two DSG's because I could promote the read
>>> lock.
>>>
>>> Andy I notice your code has a txn interface with a read to write
>>> promotion
>>> indicator? Is an upgrade method being considered to the txn interface
>>> because that was an issue I hit and why I have two dsg's. Code further up
&g

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-10 Thread Dick Murray
   Stream find(Node s, Node p, Node o) ;
>
> //default Stream find(Node s, Node p, Node o) {
> //return findDftGraph(s,p,o).map(Quad::asTriple) ;
> //}
>
> //Iterator   findUnionGraph(Node s, Node p, Node o) ;
> //Iterator   find(Node g, Node s, Node p, Node o) ;
>
>
>// contains
>
>default boolean contains(Node s, Node p, Node o)
>{ return find(s,p,o).findAny().isPresent() ; }
>default boolean contains(Node g, Node s, Node p, Node o)
>{ return find(g,s,p,o).findAny().isPresent() ; }
>
>    // Prefixes ??
> }
>
>
> https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
> also has the companion DatasetGraphStorage.
>
>Andy
>
>
>
> On 04/03/16 12:03, Dick Murray wrote:
>> LOL. The perils of a succinct update with no detail!
>>
>> I understand the Jena SPI supports read/writes via transactions and I
also
>> know that the wrapper classes provide a best effort for some of the
>> overridden methods which do not always sit well when materializing
triples.
>> For example DatasetGraphBase provides public boolean containsGraph(Node
>> graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
>> which results in a call to DatasetGraphBaseFind public Iterator
>> find(Node g, Node s, Node p, Node o) which might end up with something
>> being called in DatasetGraphInMemory depending on what has been extended
>> and overridden. This causes a problem for me because I shim the finds to
>> decide whether the triples have been materialized before calling the
>> overridden find. After extending DatasetGraphTriples and
>> DatasetGraphInMemory I realised that I had overridden most of the methods
>> so I stopped and implemented DatasetGraph and Transactional.
>>
>> In my scenario the underlying data (a vendor agnostic format to get
>> AutoCAD, Bentley, etc to work together) is never changed so the
>> DatasetGraph need not support writes. Whilst we need to provide semantic
>> access to the these files they result in ~100M triples each if
transformed,
>> there are 1000's of files, they can change multiple times per day and the
>> various disciplines typically only require a subset of triples.
>>
>> That said in my DatasetGraph implementation if you call
>> begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
>> implementation in that it does not support external writes (throws UOE)
but
>> does implement writes internally (via TriTable) because it needs to write
>> the materialized triples to answer the find.
>>
>> So if we take
>>
>> select ?s
>> where {graph  {?s a
>> }
>>
>> Jena via the SPARQL query engine will perform the following abridged
>> process.
>>
>>- Jena begins a DG read transaction.
>>- Jena calls DG find(,
ANY,
>>a ).
>>- DG will;
>>   - check if the repository r has been loaded, i.e. matching the
>>   repository name URI spec fragment to a repository file on disk
>> and loading
>>   it into the SDAI session.
>>   - check if the model m has been loaded, i.e. matching the model
name
>>   URI spec fragment to a repository model and loading it into the
SDAI
>>   session.
>>  - If we have just loaded the SDAI model check if there is any
pre
>>  caching to be done which is just a set of find triples which
>> are handled as
>>  per the normal find detailed following.
>>   - We now have a G which wraps the SDAI model and uses TriTable to
>>hold materialized triples.
>>- DG will now call G.find(ANY, a
>>).
>>- G will check the find triple against a set of already materialized
>>find triples and if it misses;
>>   - G will search a set of triple handles which know how to
materialize
>>   triples for a given find triple and if found;
>>  - G begins a TriTable write transaction and for {ANY, a
>>  } (i.e the DG & G
>> are READ but
>>  the G TriTable is WRITE);
>> - Check the find triples again we might have been in a race
for
>> the find triple and lost...
>> - Load the correct Java class for entity e which involves
>> minting the FQCN using the schema s and entity e e.g.
>> ifc2x3 and ifcslab
>> become org.jsdai.ifc2x3.ifcslab.
>> - Use this to call the SDAI method findInstances(Class> extends Entity> entityClass) which returns zero or more
>> SDAI entities from
>> which we;
>> 

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-04 Thread Dick Murray
he fast response!
> >>>> I have a set of disk based binary SDAI repository's which are
> based on ISO10303 parts 11/21/25/27 otherwise known as the
> EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
> be +1Gb. However after processing into a SDAI binary I typically see a size
> reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
> the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
> 1000's of similar sized STEP files...
> >>>> Typically only a small subset of the STEP file needs to be queried
> but sometimes other parts need to be queried. Hence the on demand caching
> and DatasetGraphInMemory. The aim is that in the find methods I check a
> cache and call the native SDAI find methods based on the node URI's in the
> case of a cache miss, calling the add methods for the minted tuples, then
> passing on the call to the super find. The underlying SDAI repository's are
> static so once a subject is cached no other work is required.
> >>>> As the DatasetGraphInMemory is commented as very fast quad and triple
> access it seemed a logical place to extend. The shim cache would be set to
> expire entries and limit the total number of tuples power repository. This
> is currently deployed on a 256Gb ram device.
> >>>> In the bigger picture l have a service very similar to Fuseki which
> allows SPARQL requests to be made against Datasets which are either TDB or
> SDAI cache backed.
> >>>> What was DatasetGraphInMemory created for..? ;-)
> >>>> Dick
> >>>>
> >>>>  Original message 
> >>>> From: "A. Soroka" <aj...@virginia.edu>
> >>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
> >>>> To: users@jena.apache.org
> >>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
> DatasetGraphInMemory
> >>>>
> >>>> I wrote the DatasetGraphInMemory  code, but I suspect your question
> may be better answered by other folks who are more familiar with Jena's
> DatasetGraph implementations, or may actually not have anything to do with
> DatasetGraph (see below for why). I will try to give some background
> information, though.
> >>>>
> >>>> There are several paths by which where DatasetGraphInMemory can be
> performing finds, but they come down to two places in the code, QuadTable::
> and TripleTable::find and in default operation, the concrete forms:
> >>>>
> >>>>
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
> >>>>
> >>>> for Quads and
> >>>>
> >>>>
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
> >>>>
> >>>> for Triples. Those methods are reused by all the differently-ordered
> indexes within Hex- or TriTable, each of which will answer a find by
> selecting an appropriately-ordered index based on the fixed and variable
> slots in the find pattern and using the concrete methods above to stream
> tuples back.
> >>>>
> >>>> As to why you are seeing your methods called in some places and not
> in others, DatasetGraphBaseFind features methods like findInDftGraph(),
> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
> make a selection between those methods— that is done by
> DatasetGraphBaseFind. So that is where you will find the logic that should
> answer your question.
> >>>>
> >>>> Can you say a little more about your use case? You seem to have some
> efficient representation in memory of your data (I hope it is in-memory—
> otherwise it is a very bad choice to subclass DSGInMemory) and you want to
> create tuples on the fly as queries are received. That is really not at all
> what DSGInMemory is for (DSGInMemory is using map structures for indexing
> and in default mode, uses persistent data structures to support
> transactionality). I am wondering whether you might not be much better
> served by tapping into Jena at a different place, perhaps implementing the
> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
> implementing Quad- and TripleTable and using the constructor
> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
> >>>>
> >>>> ---
> >>>> A. Soroka
> >>>> The University of Virginia Library
> >>>>
> >>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com>
> wrote:
> >>>>>
> >>>>> Hi.
> >>>>>
> >>>>> Does anyone know the "find" paths through DatasetGraphInMemory
> please?
> >>>>>
> >>>>> For example if I extend DatasetGraphInMemory and override
> >>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
> "select
> >>>>> * where {?s ?p ?o}" however if I override the other
> >>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
> {?s ?p
> >>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method
> it's
> >>>>> calling (but as I type I'm guessing it's optimised to return the
> HexTable
> >>>>> nodes...).
> >>>>>
> >>>>> Would I be better off overriding HexTable and TriTable classes find
> methods
> >>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to
> end in
> >>>>> one of these methods?
> >>>>>
> >>>>> I need to know the root find methods so that I can shim them to
> create
> >>>>> triples/quads before they perform the find.
> >>>>>
> >>>>> I need to create Triples/Quads on demand (because a bulk load would
> create
> >>>>> ~100M triples but only ~1000 are ever queried) and the source binary
> form
> >>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
> quads)
> >>>>> than quads.
> >>>>>
> >>>>> Regards Dick Murray.
> >>>>
> >>>
> >>
> >
>
>


POM issue with 3.0.1

2016-02-05 Thread Dick Murray
Hi I'm trying to get the 3.0.1 to build using eclipse/maven but it refuses
to "find" it in the central repository.

I ran mvn dependency:get
-Dartifact=org.apache.jena:apache-jena-libs:jar:3.0.1 and got the
following...

Am I missing something..?

[INFO] Resolving org.apache.jena:apache-jena-libs:3.0.1:jar with transitive
dependencies
Downloading:
https://repo.maven.apache.org/maven2/org/apache/jena/apache-jena-libs/jar/apache-jena-libs-jar.pom
[WARNING] Missing POM for org.apache.jena:apache-jena-libs:3.0.1:jar
Downloading:
https://repo.maven.apache.org/maven2/org/apache/jena/apache-jena-libs/jar/apache-jena-libs-jar.3.0.1
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 20.303 s
[INFO] Finished at: 2016-02-05T10:51:35+00:00
[INFO] Final Memory: 13M/239M
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-dependency-plugin:2.8:get (default-cli) on
project iungo-core: Couldn't download artifact: Missing:
[ERROR] --
[ERROR] 1) org.apache.jena:apache-jena-libs:3.0.1:jar
[ERROR]
[ERROR] Try downloading the file manually from the project website.
[ERROR]
[ERROR] Then, install it using the command:
[ERROR] mvn install:install-file -DgroupId=org.apache.jena
-DartifactId=apache-jena-libs -Dversion=jar -Dpackaging=3.0.1
-Dfile=/path/to/file
[ERROR]
[ERROR] Alternatively, if you host your own repository you can deploy the
file there:
[ERROR] mvn deploy:deploy-file -DgroupId=org.apache.jena
-DartifactId=apache-jena-libs -Dversion=jar -Dpackaging=3.0.1
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR] Path to dependency:
[ERROR] 1) org.apache.maven.plugins:maven-downloader-plugin:jar:1.0
[ERROR] 2) org.apache.jena:apache-jena-libs:3.0.1:jar
[ERROR]
[ERROR] --
[ERROR] 1 required artifact is missing.
[ERROR]
[ERROR] for artifact:
[ERROR] org.apache.maven.plugins:maven-downloader-plugin:jar:1.0
[ERROR]
[ERROR] from the specified remote repositories:
[ERROR] central (https://repo.maven.apache.org/maven2, releases=true,
snapshots=false)
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
dick@Dick-M3800:~/mvntest$


Re: Inserting large volumes into a RW TDB store.

2014-10-21 Thread Dick Murray
I might be confusing the DynamicDataset...

Dick

On 20 October 2014 20:40, Dick Murray dandh...@gmail.com wrote:

 Thanks that confirms what I thought.

 Crazy idea time!

 Am I correct in thinking that there is a dataset view which allows you
 to present multiple datasets as one? I'm sure I saw it in the codebase some
 time back?

 If I present the current datasets using this view I can create a new
 dataset and load in the new quads without a transaction then add it to a
 transient reference which is used by the system from then on and the old
 view would then be GC.

 This would keep the concurrency in the system and keep failures within a
 dataset. Currently the TDB is 53GB for the 120M triples and it's estimated
 that it will grow by the same amount every working day which equates to
 31,200M or 31B triples and 13,780GB or 14TB on disk in a year...

 Dick

 On 20 Oct 2014 17:56, Andy Seaborne a...@apache.org wrote:
 
  On 20/10/14 10:12, Dick Murray wrote:
 
  Hello all.
 
  Are there any pointers to inserting large volumes of data in a
 persistent
  RW TDB store please?
 
  I currently have a 8M line 500MB+ input file which is being parsed by
  JavaCC and the created quads inserted into a TDB store.
 
  The process genreates 120M quads and takes just over 2hrs which is;
 
  60M quads/hr/ or
  1M quads/min or
  1 quads/sec.
 
  Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB
  (16GB available) on a i7 8 core and a 512GB SSD.
 
  I am working with the datasetGraph after opening the TDB store to remove
  any extra code which might slow the process down. I begin/commit a
  transaction for every 1000 input rows as prior to this a OOME occured
 after
  ~3M input rows if I tried to wrap the entire load in a transaction. The
 TDB
  store is being read from so I am unable to use a TDB loader.
 
  I don't believe the runtime is poor but any pointers which would improve
  the speed...
 
 
  Dick,
 
  If you are loading into a live TDB store with transactions, there will
 be less performance than bulking offline.  The system is a bit read-centric.
 
  The only tuning parameter you have at your disposal is the commit size.
 1000 is very small - try more like 100K.
 
  This isn't inside Fuseki so some batching already occurs but the size of
 transactions themselves can make a difference.
 
  Andy
 



Inserting large volumes into a RW TDB store.

2014-10-20 Thread Dick Murray
Hello all.

Are there any pointers to inserting large volumes of data in a persistent
RW TDB store please?

I currently have a 8M line 500MB+ input file which is being parsed by
JavaCC and the created quads inserted into a TDB store.

The process genreates 120M quads and takes just over 2hrs which is;

60M quads/hr/ or
1M quads/min or
1 quads/sec.

Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB
(16GB available) on a i7 8 core and a 512GB SSD.

I am working with the datasetGraph after opening the TDB store to remove
any extra code which might slow the process down. I begin/commit a
transaction for every 1000 input rows as prior to this a OOME occured after
~3M input rows if I tried to wrap the entire load in a transaction. The TDB
store is being read from so I am unable to use a TDB loader.

I don't believe the runtime is poor but any pointers which would improve
the speed...

Dick.


Re: Inserting large volumes into a RW TDB store.

2014-10-20 Thread Dick Murray
Thanks that confirms what I thought.

Crazy idea time!

Am I correct in thinking that there is a dataset view which allows you to
present multiple datasets as one? I'm sure I saw it in the codebase some
time back?

If I present the current datasets using this view I can create a new
dataset and load in the new quads without a transaction then add it to a
transient reference which is used by the system from then on and the old
view would then be GC.

This would keep the concurrency in the system and keep failures within a
dataset. Currently the TDB is 53GB for the 120M triples and it's estimated
that it will grow by the same amount every working day which equates to
31,200M or 31B triples and 13,780GB or 14TB on disk in a year...

Dick

On 20 Oct 2014 17:56, Andy Seaborne a...@apache.org wrote:

 On 20/10/14 10:12, Dick Murray wrote:

 Hello all.

 Are there any pointers to inserting large volumes of data in a persistent
 RW TDB store please?

 I currently have a 8M line 500MB+ input file which is being parsed by
 JavaCC and the created quads inserted into a TDB store.

 The process genreates 120M quads and takes just over 2hrs which is;

 60M quads/hr/ or
 1M quads/min or
 1 quads/sec.

 Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB
 (16GB available) on a i7 8 core and a 512GB SSD.

 I am working with the datasetGraph after opening the TDB store to remove
 any extra code which might slow the process down. I begin/commit a
 transaction for every 1000 input rows as prior to this a OOME occured
after
 ~3M input rows if I tried to wrap the entire load in a transaction. The
TDB
 store is being read from so I am unable to use a TDB loader.

 I don't believe the runtime is poor but any pointers which would improve
 the speed...


 Dick,

 If you are loading into a live TDB store with transactions, there will be
less performance than bulking offline.  The system is a bit read-centric.

 The only tuning parameter you have at your disposal is the commit size.
1000 is very small - try more like 100K.

 This isn't inside Fuseki so some batching already occurs but the size of
transactions themselves can make a difference.

 Andy



Re: Dynamic graph/model inference within a select.

2013-06-19 Thread Dick Murray
Hi sorry PICNIC moment...

I get the following results which is what I need. Against the dataset I get
2 results for each graph. Against the datasetgraphmap with override I get 2
results for g2 but 4 results for g1 because of the RDFS.

select * where {{ ?g a http://example.org/graph#Graph }. graph ?g {
http://example.org/trek/Triton ?p ?o}}
---
| g  | p
  | o  |
===
| http://example.org/graphs/g1 | 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
http://example.org/bikes#SingleSpeed |
| http://example.org/graphs/g1 | 
http://www.w3.org/2000/01/rdf-schema#label  | Triton
  |
| http://example.org/graphs/g2 | 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
http://example.org/bikes#SingleSpeed |
| http://example.org/graphs/g2 | 
http://www.w3.org/2000/01/rdf-schema#label  | Triton
  |
---

select * where {{ ?g a http://example.org/graph#Graph }. graph ?g {
http://example.org/trek/Triton ?p ?o}}
---
| g  | p
  | o  |
===
| http://example.org/graphs/g1 | 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
http://example.org/bikes#SingleSpeed |
| http://example.org/graphs/g1 | 
http://www.w3.org/2000/01/rdf-schema#label  | Triton
  |
| http://example.org/graphs/g1 | 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
http://example.org/bikes#Road|
| http://example.org/graphs/g1 | 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
http://example.org/bikes#Bike|
| http://example.org/graphs/g2 | 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type | 
http://example.org/bikes#SingleSpeed |
| http://example.org/graphs/g2 | 
http://www.w3.org/2000/01/rdf-schema#label  | Triton
  |
---


Problem with the 2 phase approach is that the phase 1 query isn't easy and
then there's the issue whether it's atomic. Ideally it needs to be dynamic
as the triples/quads are iterated

Are you saying that optimization might cause the getGraph not to be called?
Being called more than once for the same graph I can get around i.e. keep a
cache for the duration of the query.

Thanks again.

Dick.



On 19 June 2013 12:20, Andy Seaborne a...@apache.org wrote:

 On 18/06/13 18:22, Dick Murray wrote:

 I'm looking for dynamic inference based on the select. Given a dataset
 with
 multiple named graphs I would like the ability to wrap specific named
 graphs based on some form of filter when the select is processed.


 The dataset being queried can not be manipulated during the query.

 Whether getGraph is called is evaluator dependent (it's not in TDB which
 works on quads).

 There is no guarantee a query is executed in a particular order.  It coudl
 do the GRAPH bit before the dft graph access.  Currently, that unlikely,
 but there is no guarantee.  Oh - and it may happen twice for rewritten
 queries (equality optimizations like to generate multiple more grounded
 accesses).


  Given the dataset D which contains the named graphs G1, G2, G3 I would
 like
 G2 to be returned with RDFS inference if it is queried in a select. I have
 achieved this by wrapping the graph as an InfModel and using a
 DatasetGraphMap but this requires that the graph be known before the
 select
 is executed. What I'm trying to find (if it exists) is the point during
 the
 select processing when the graph is identified and used? Does this exist
 in
 a TDB Dataset or is it just a set of quads?


 Such a point exists (OpExecutor.execute(OpGraph)) but because of
 optimization and/or converting to quads.

 A TDB datsegraph is a set of triples (dft graph) + a set of quads (named
 graphs).Full quad based optimization isn't really done currently but it
 will be in future so any internal approach is going to be vulnerable to
 changes.

 I think you need a 2-phase approach.

 Phase-1 is setup - query the data and determine which graphs to add
 inference to.

 Phase-2 : Build a new datasetgraph and then query that for the real
 answers.

 Maybe that's what you are doing.  If you query dgm I'd expect to see the
 RDFS inferences but it does not show where you issue the query. Complete,
 minimal example?

 Andy

Re: Dynamic graph/model inference within a select.

2013-06-18 Thread Dick Murray
I'm looking for dynamic inference based on the select. Given a dataset with
multiple named graphs I would like the ability to wrap specific named
graphs based on some form of filter when the select is processed.

Given the dataset D which contains the named graphs G1, G2, G3 I would like
G2 to be returned with RDFS inference if it is queried in a select. I have
achieved this by wrapping the graph as an InfModel and using a
DatasetGraphMap but this requires that the graph be known before the select
is executed. What I'm trying to find (if it exists) is the point during the
select processing when the graph is identified and used? Does this exist in
a TDB Dataset or is it just a set of quads?

Dick.


On 18 June 2013 16:01, Andy Seaborne a...@apache.org wrote:

 Dick,

 I'm not completely sure what you're trying to do - a complete minimal
 example showing how they bits and pieces fit together would be good.  It
 seems to be querying the dataset without the inference graph.  I don't see
 where you query the dataset (and which one)

  if (graphNode.getURI().equals(**types.getURI())) {

 if (graphNode.equals(types.**asNode())  {



 On 18/06/13 14:22, Dick Murray wrote:

 Hi.

 Is it possible to get at the graph i.e. the ?g (specifically the returned
 nodes) when the following query is executed?


 Yes - getGraph / getNamedModel depending on which level your working at.


  SELECT  *
 WHERE
{ { ?g 
 http://www.w3.org/1999/02/22-**rdf-syntax-ns#typehttp://www.w3.org/1999/02/22-rdf-syntax-ns#type
 
 http://www.unit4.com/daas/**graph#Graphhttp://www.unit4.com/daas/graph#Graph
 }
  GRAPH ?g
{ ?s ?p ?o }
}

 When the result is instantiated I want to return the ?g as an RDFS
 infmodel. Ideally I want to decide what to return based on the ?g. I've
 traced the execSelect() and the ResultSetMem() but drew a blank as to
 where
 I can get at the ?g's!


 ResultSet.next().getResource(**g) ;
or
 ResultSet.nextBinding().get(**Var.alloc(g)) ;




 The following allows me to wrap the returned graph but this is static i.e.
 I need to know the ?g's to generate the dgm to pass to the
 QueryExecutionFactory.

 dataset.begin(ReadWrite.READ);
 DatasetGraphMap dgm = new DatasetGraphMap(dataset.**asDatasetGraph()) {

 @Override
 public Graph getGraph(Node graphNode) {
 Graph g = super.getGraph(graphNode);
 if (graphNode.getURI().equals(**types.getURI())) {
 g = asRDFS(g);
 }
 return g;
 }
   public Graph asRDFS(Graph g) {
 return
 ModelFactory.createRDFSModel(**ModelFactory.**createModelForGraph(g)).**
 getGraph();
 }
   };
 Graph g = dgm.getGraph(types.asNode());
 info(g.size());
 dataset.end();

 For the following triples loaded in the default graph;

 @prefix rdf: 
 http://www.w3.org/1999/02/22-**rdf-syntax-ns#http://www.w3.org/1999/02/22-rdf-syntax-ns#
 .
 @prefix rdfs: 
 http://www.w3.org/2000/01/**rdf-schema#http://www.w3.org/2000/01/rdf-schema#
 .

 @prefix graph: 
 http://www.unit4.com/daas/**graph#http://www.unit4.com/daas/graph#
 .

 @prefix graphs: 
 http://www.unit4.com/daas/**graphs/http://www.unit4.com/daas/graphs/
 .

 graph:Graph
 rdf:type rdfs:Class.
 graphs:g1
 rdf:type graph:Graph.

 and these loaded in a named graph 
 http://www.unit4.com/daas/**graphs/g1http://www.unit4.com/daas/graphs/g1
 ;

 @prefix rdf: 
 http://www.w3.org/1999/02/22-**rdf-syntax-ns#http://www.w3.org/1999/02/22-rdf-syntax-ns#
 .
 @prefix rdfs: 
 http://www.w3.org/2000/01/**rdf-schema#http://www.w3.org/2000/01/rdf-schema#
 .

 @prefix graph: 
 http://www.unit4.com/daas/**graph#http://www.unit4.com/daas/graph#
 .

 @prefix graphs: 
 http://www.unit4.com/daas/**graphs/http://www.unit4.com/daas/graphs/
 .

 graphs:g1
 rdfs:label Graph 1.

 A select returns;

 select * where {{ ?g a 
 http://www.unit4.com/daas/**graph#Graphhttp://www.unit4.com/daas/graph#Graph
 }. graph ?g
 {?s ?p ?o}}
 --**--**
 --**--**
 
 | g | s
  | p| o |
 ==**==**
 ==**==**
 
 | 
 http://www.unit4.com/daas/**graphs/g1http://www.unit4.com/daas/graphs/g1
 | 
 http://www.unit4.com/daas/**graphs/g1http://www.unit4.com/daas/graphs/g1
 | 
 http://www.w3.org/2000/01/rdf-**schema#labelhttp://www.w3.org/2000/01/rdf-schema#label
 | Graph 1 |
 --**--**
 --**--**
 

 What I want is for it to return about 40 more... :-)





Re: Issue adding Triples to Graph which has been added to a DatasetGraph.

2012-11-30 Thread Dick Murray
Hi Andy.

Thanks for the reply and just so I am sure I understand this... :-)

Looking at the Jena code for a SPARUL create graph urn:test against a
folder backed TDB store it gives the following stack;

Daemon Thread [Thread-24] (Suspended)
GraphMem(GraphMemBase).init(ReificationStyle) line: 52
GraphMem.init(ReificationStyle) line: 35
Factory.createGraphMem(ReificationStyle) line: 50
Factory.createDefaultGraph(ReificationStyle) line: 44
Factory.createDefaultGraph() line: 38
GraphFactory.createJenaDefaultGraph() line: 54
GraphFactory.createDefaultGraph() line: 48

U4DefaultGraph$U4UpdateEngineWorker(UpdateEngineWorker).visit(UpdateCreate)
line: 140
U4DefaultGraph$U4UpdateEngineWorker.visit(UpdateCreate) line: 369
UpdateCreate.visit(UpdateVisitor) line: 58
U4DefaultGraph$U4UpdateEngine.execute() line: 314
UpdateProcessorBase.execute() line: 56
U4DefaultGraph.graphSPARUL(U4HTTPDoRequestMessage) line: 1065
U4DefaultGraph$5.go(Object) line: 857
U4DefaultGraph$3.go(Object) line: 737
U4DefaultGraph(U4DefaultNode).nodeMessageHandler(U4Message) line:
713
U4DefaultNode$MessageHandler.run() line: 625
ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1110
ThreadPoolExecutor$Worker.run() line: 603
Thread.run() line: 722

Specifically GraphFactory.createDefaultGraph() line: 48 is;

/**
Answer a memory-based Graph with the Standard reification style.
*/
public static Graph createDefaultGraph()
{ return createDefaultGraph( ReificationStyle.Standard ); }

So Jena uses a memory based graph to fulfil the create graph request.

But uses add(Quad) to fulfil the insert data [into {iri}] {...} request as
detailed with the following stack.

DatasetGraphTxn(DatasetGraphTriplesQuads).add(Quad) line: 33
DatasetGraphTransaction(DatasetGraphTrackActive).add(Quad) line: 133
GraphStoreBasic(DatasetGraphWrapper).add(Quad) line: 72
UpdateEngineWorker.addToGraphStore(GraphStore, Quad) line: 484
U4DefaultGraph$U4UpdateEngineWorker(UpdateEngineWorker).visit(UpdateDataInsert)
line: 281

So I need to add the graph (as a empty mem based graph) to the TDB backed
dataset then add quads to the dataset..?

Dick.


On 29 November 2012 21:31, Andy Seaborne a...@apache.org wrote:

 On 29/11/12 21:24, Andy Seaborne wrote:

   final Dataset d = TDBFactory.createDataset();
   Node n = Node.createURI(urn:graph);
   d.addGraph(n, GraphFactory.**createDefaultGraph());

 (sorry for the delay replying)

 I'm afraid that isn't going to work.  I suspect it should be an error.


 It seems that the code adds in the contents of the graph into the dataset
 by copying over the triples - i.e. update before adding.

 Andy



 A TDB dataset is a triples+quad store.  You can't add an in-memory
 storage-backed graph to a dataset backed by TDB.

 If you want a mixe dataset, you can create an in-memory dataset and add
 in TDB backed graphs.

  Andy

 On 28/11/12 20:32, Dick Murray wrote:

 Hi all.

 I have an issue where triples which are added to a graph which has been
 added to a dataset are not visible in the dataset.

 However if I add the graph then add quads to the dataset with the quad
 graph node as the node used to add the graph the quads are visible.

 Is this expected?





Re: Extending UpdateVisitor to provide security within visit(UpdateVisitor visitor).

2012-10-30 Thread Dick Murray
) ;
for ( Update up : request ) {
up.visit(worker) ;
}
graphStore.finishRequest(request) ;
}
}

class CustomUpdateEngineWorker extends UpdateEngineWorker {

public CustomUpdateEngineWorker(GraphStore graphStore, Binding
initialBinding, Context context) {
super(graphStore, initialBinding, context);
}

@Override
public void visit(UpdateDrop update) {
logger.info(String.format(visit(%s), update));
super.visit(update);
}

}

}


On 29 October 2012 21:02, Andy Seaborne a...@apache.org wrote:

 There is a slightly tricky point here - if you deny an operations, then
 partial operation get done. That's OK on a transactional system - it simply
 aborts - but not if it isn't transaction storage.  It might be better to
 asses the update before dispatching it to the execution engine.

 That can be done with Rob's suggestion - do the whole of the request
 before any execution.

 Andy




 On 29/10/12 18:02, Rob Vesse wrote:

 Re Step 2 - I just made a commit to trunk so that with the latest code you
 can extend UpdateEngineMain and simply override the protected
 prepareWorker() method to return your custom UpdateVisitor rather than the
 default UpdateEngineWorker I.e. this avoids the need to implement the
 execute() method yourself


 Also in Step 3 you should be extending from UpdateEngineWorker rather than
 the non-existent UpdateWorker

 Hope this helps,

 Rob

 On 10/29/12 9:53 AM, Rob Vesse rve...@cray.com wrote:

  Hey Dick

 Yes we do this in our product in a production environment to replace the
 standard update handling with our own completely custom one

 Your desired extension is actually even easier than ours, extending
 Update
 evaluation basically requires you to do three things as follows.

 1 - Create and register a custom UpdateEngineFactory

 Create a class that implements the UpdateEngineFactory interface, this
 has
 two methods for you to implement.  Simply return true for the accept()
 method to indicate you wish to handle all updates and then for the
 create() method return a new instance of the class you create in Step 2

 Your code will need to ensure that this factory gets registered by
 calling
 UpdateEngineRegistry.add(new CustomUpdateEngineFactory()); in order for
 your code to intercept updates.

 2 - Create a custom UpdateEngine

 Create a class that extends from UpdateEngineBase and implements the
 abstract execute() method, you can simply modify the default
 implementation found in UpdateEngineMain like so:

 @Override
 public void execute()
 {
   graphStore.startRequest(**request) ;
   CustomUpdateEngineWorker worker = new
 CustomUpdateEngineWorker(**graphStore, startBinding, context) ;
   for ( Update up : request ) {
 up.visit(worker) ;
   }
   graphStore.finishRequest(**request) ;
 }


 3 - Create your custom UpdateVisitor

 Create a class that extends from UpdateWorker, this is the class you are
 referencing as CustomUpdateEngineWorker from Step 2, I assume you pick a
 more appropriate name.  Then you simply override the methods that you
 want
 to add access control functionality to like so:

 @Override
 public void visit(UpdateCreate update)
 {
   if (deny(args)) {
 //Handle the error case
   } else {
 //Otherwise defer to normal logic
 super.visit(update);
   }
 }

 Hope this helps,

 Rob



 On 10/29/12 6:43 AM, Dick Murray dandh...@gmail.com wrote:

  Hi all

 I need to permit/deny certain SPARUL update operations e.g. deny create|
 drop graph.

 I've looked at the UpdateEngineMain and UpdateVisitor classes and was
 wondering if anyone has extended or encapsulated these before? Ideally
 I'd
 like to capture the visit just prior to the actual visit.

 i.e the UpdateEngineWorker has...

 @Override
 public void visit(UpdateCreate update)
 {
 Node g = update.getGraph() ;
 if ( g == null )
 return ;
 if ( graphStore.containsGraph(g) )
 {
 if ( ! alwaysSilent  ! update.isSilent() )
 error(Graph store already contains graph : +g) ;
 return ;
 }
 // In-memory specific
 graphStore.addGraph(g, GraphFactory.**createDefaultGraph()) ;
 }

 ...and I need a...

 if (deny(...)) {
 error(update create denied);
 return;

 Also need it too work whether the graphStore was from a Dataset or
 TDB...
 ideally... :-)

 Regards Dick.