Re: Clearing cache does not effect query execution time

2015-08-24 Thread Rob Vesse
Comments inline:

On 24/08/2015 14:28, Ankur Padia padiaan...@gmail.com wrote:

Hello everyone,

   I want to execute a bunch of queries on Fuseki server using default
settings and want to clear cache. To accomplish the goal mentioned, I used
following linux command to clean the cache,

*sync; echo 3 | sudo tee /proc/sys/vm/drop_caches*

However, instead of getting execution time in seconds which means results
was fetched from harddisk, I am getting results in millisecond .

TDB the storage layer used underneath Fuseki by default uses memory mapped
files for its own caching which is separate and distinct from any OS level
caching

Dropping the OS caches thus has no effect because TDB does not rely on OS
caching



Also, the very first query takes time in seconds however the later queries
following it takes few millisecond both having same number of triple
pattern.

There is also the issue of JVM class loading, the first few queries will
likely cause many classes to be loaded by the JVM which is a one time
start up cost paid by any Java application.

If you are trying to do benchmarking then you should always run warm-ups
to bring the system to a hot state since running from a cold state is
always likely to hit discrepancies like this and won't give you
reproducible figures

Rob


Can anyone please guide me through the phenomena happening inside Fuseki ?
Fuseki was hosted on linux, Fedora 17.


Ankur Padia.






Re: Fuseki 2 HA or on-the-fly backups?

2015-08-24 Thread Jason Levitt
Great info, thanks.

 Some organisations achieve this by running a load balancer in front of
 several replicas then co-ordinating the update process.

So, they're running the same query against other nodes behind the load
balancer to keep things in sync?

 You can do a live backup

So, an HTTP POST /$/backup/*{name}*  initiates a backup and that
results in a gzip-compressed N-Quads file.

What does a restore look like from that file?

-J




On Mon, Aug 24, 2015 at 4:08 AM, Rob Vesse rve...@dotnetrdf.org wrote:
 Andy already answered 1 but more on 2

 Assuming you use TDB then in-memory checkpointing already happens.  TDB
 caches data into memory but fundamentally is a persistent disk backed
 database that uses write-ahead logging for transactions and failure
 recovery so this already happens automatically and is below the level of
 Fuseki (you get this behaviour wherever you use TDB provided you use it
 transactionally which Fuseki always does)

 Rob

 On 24/08/2015 05:51, Jason Levitt slimands...@gmail.com wrote:

Just wondering if there are any projects out there
to provide:

1) HA (high availability) configuration of Fuseki such
as mirroring or hot/standby failover.

2) Some kind of on-the-fly backup of Fuseki when it's
running in RAM. This might be similar to how Hadoop
1.x checkpoints the in-RAM namenode data structures.

BTW, are there any tools for testing the consistency of the Fuseki
data structures when Fuseki is temporarily halted?

Cheers,

Jason






Re: Apache Maven

2015-08-24 Thread aj...@virginia.edu
Apache Maven is the build management system used by Jena. You will find info 
here:

https://maven.apache.org/

If you are just using Jena as a framework or toolkit you will probably not need 
Maven. If you intend to do actual development work on Jena, you will need to 
become familiar with Maven. Most Eclipse installs do indeed have Maven 
pre-integrated.

---
A. Soroka
The University of Virginia Library

On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com wrote:

 What Apache Maven in Eclipse is used for? In what type of application we
 need it?
 Also I have found various tutorials showing downloading, installing, and
 integrating Maven in Eclipse but I think Eclipse has already Maven
 installed and integrated. Is it so?



Apache Maven

2015-08-24 Thread kumar rohit
What Apache Maven in Eclipse is used for? In what type of application we
need it?
Also I have found various tutorials showing downloading, installing, and
integrating Maven in Eclipse but I think Eclipse has already Maven
installed and integrated. Is it so?


Re: Clearing cache does not effect query execution time

2015-08-24 Thread Andy Seaborne
There is another important cache - the node table cache is not a filing 
system cache at all.  It's in-process, in Java, cache.  It caches what 
is otherwise random I/O and potentially expensive.


Depends on what your queries are as to which caching effects dominate 
execution.


SELECT * { ?s ?p ?o } is almost all node table - the index is a scan and 
very read-friendly even when cold.  This is especially true if the 
database has had little in the way of updates since first built - the 
loaders generate some indexes, like the one that query will use, in 
large, disk aligned units.


The only way to get reliable (and realistic - servers run for a long 
time) results is to pre-warm the system as Rob says.


Andy

On 24/08/15 15:09, Rob Vesse wrote:

Comments inline:

On 24/08/2015 14:28, Ankur Padia padiaan...@gmail.com wrote:


Hello everyone,

   I want to execute a bunch of queries on Fuseki server using default
settings and want to clear cache. To accomplish the goal mentioned, I used
following linux command to clean the cache,

*sync; echo 3 | sudo tee /proc/sys/vm/drop_caches*

However, instead of getting execution time in seconds which means results
was fetched from harddisk, I am getting results in millisecond .


TDB the storage layer used underneath Fuseki by default uses memory mapped
files for its own caching which is separate and distinct from any OS level
caching

Dropping the OS caches thus has no effect because TDB does not rely on OS
caching




Also, the very first query takes time in seconds however the later queries
following it takes few millisecond both having same number of triple
pattern.


There is also the issue of JVM class loading, the first few queries will
likely cause many classes to be loaded by the JVM which is a one time
start up cost paid by any Java application.

If you are trying to do benchmarking then you should always run warm-ups
to bring the system to a hot state since running from a cold state is
always likely to hit discrepancies like this and won't give you
reproducible figures

Rob



Can anyone please guide me through the phenomena happening inside Fuseki ?
Fuseki was hosted on linux, Fedora 17.


Ankur Padia.









Re: Fuseki 2 HA or on-the-fly backups?

2015-08-24 Thread Andy Seaborne

On 24/08/15 16:15, Jason Levitt wrote:

Great info, thanks.


Some organisations achieve this by running a load balancer in front of
several replicas then co-ordinating the update process.


So, they're running the same query against other nodes behind the load
balancer to keep things in sync?


You can do a live backup


So, an HTTP POST /$/backup/*{name}*  initiates a backup and that
results in a gzip-compressed N-Quads file.

What does a restore look like from that file?


You just load it into an empty database (tdbloader etc).

Andy



-J




On Mon, Aug 24, 2015 at 4:08 AM, Rob Vesse rve...@dotnetrdf.org wrote:

Andy already answered 1 but more on 2

Assuming you use TDB then in-memory checkpointing already happens.  TDB
caches data into memory but fundamentally is a persistent disk backed
database that uses write-ahead logging for transactions and failure
recovery so this already happens automatically and is below the level of
Fuseki (you get this behaviour wherever you use TDB provided you use it
transactionally which Fuseki always does)

Rob

On 24/08/2015 05:51, Jason Levitt slimands...@gmail.com wrote:


Just wondering if there are any projects out there
to provide:

1) HA (high availability) configuration of Fuseki such
as mirroring or hot/standby failover.

2) Some kind of on-the-fly backup of Fuseki when it's
running in RAM. This might be similar to how Hadoop
1.x checkpoints the in-RAM namenode data structures.

BTW, are there any tools for testing the consistency of the Fuseki
data structures when Fuseki is temporarily halted?

Cheers,

Jason









Re: Apache Maven

2015-08-24 Thread kumar rohit
what does actual development work on Jena,  means? can you differentiate
between Jena as a framework and actual development with Jena?


On Mon, Aug 24, 2015 at 4:21 PM, aj...@virginia.edu aj...@virginia.edu
wrote:

 Apache Maven is the build management system used by Jena. You will find
 info here:

 https://maven.apache.org/

 If you are just using Jena as a framework or toolkit you will probably not
 need Maven. If you intend to do actual development work on Jena, you will
 need to become familiar with Maven. Most Eclipse installs do indeed have
 Maven pre-integrated.

 ---
 A. Soroka
 The University of Virginia Library

 On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com wrote:

  What Apache Maven in Eclipse is used for? In what type of application we
  need it?
  Also I have found various tutorials showing downloading, installing, and
  integrating Maven in Eclipse but I think Eclipse has already Maven
  installed and integrated. Is it so?




Re: Apache Maven

2015-08-24 Thread Colin Maudry
Hi Kumar,

What you describe is using Jena.

Doing development means writing code to extend Jena  functionality with
plugins, bug fixes, etc.

So it seems like you don't need Maven.

Colin

On 24/08/2015 18:10, kumar rohit wrote:
 yes I develop ontology in Protege, import it in Jena, iterate through its
 triples, and running SPARQL queries using Jena. If this is Jena code
 writing, then what is using jena code?
 Thanks for your time.


 On Mon, Aug 24, 2015 at 4:54 PM, aj...@virginia.edu aj...@virginia.edu
 wrote:

 Are you _using_ Jena's code or _writing new Jena code_? The latter is
 actual development work.

 ---
 A. Soroka
 The University of Virginia Library

 On Aug 24, 2015, at 11:52 AM, kumar rohit kumar.en...@gmail.com wrote:

 what does actual development work on Jena,  means? can you
 differentiate
 between Jena as a framework and actual development with Jena?


 On Mon, Aug 24, 2015 at 4:21 PM, aj...@virginia.edu aj...@virginia.edu
 wrote:

 Apache Maven is the build management system used by Jena. You will find
 info here:

 https://maven.apache.org/

 If you are just using Jena as a framework or toolkit you will probably
 not
 need Maven. If you intend to do actual development work on Jena, you
 will
 need to become familiar with Maven. Most Eclipse installs do indeed have
 Maven pre-integrated.

 ---
 A. Soroka
 The University of Virginia Library

 On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com
 wrote:
 What Apache Maven in Eclipse is used for? In what type of application
 we
 need it?
 Also I have found various tutorials showing downloading, installing,
 and
 integrating Maven in Eclipse but I think Eclipse has already Maven
 installed and integrated. Is it so?





Re: Fuseki 2.0 configuration issues

2015-08-24 Thread Adrian Gschwend
On 24.08.15 13:05, Andy Seaborne wrote:

Hi Andy,

 Adrian - That would be perfect.  I don't think I need the data, just
 the setup.  It would also be useful to know exactly how you are
 making the call.  A per-query timeout is possible with the header
 Timeout: or parameter timeout=.

I use this curl command to execute it:

curl -H Accept: application/n-triples --data-urlencode
query@construct/map_municipality2classes.sparql
http://localhost:3030/bfs/sparql -o out/map_municipality2classes.nt

Didn't check the exact header sent though.

I've tared everything including the data, it's just 14MB compressed.
Execute fuseki-server within the fuseki subdirectory, the shell-script
fuseki-construct.sh fires a bunch of construct queries into out directory.

http://ktk.netlabs.org/misc/fuseki-timeout.tar.bz2

Note that the one I mention above is commented-out right now in this
shell script.

regards

Adrian


Re: Apache Maven

2015-08-24 Thread kumar rohit
I got it.. thanks

On Mon, Aug 24, 2015 at 5:32 PM, Colin Maudry co...@maudry.com wrote:

 Hi Kumar,

 What you describe is using Jena.

 Doing development means writing code to extend Jena  functionality with
 plugins, bug fixes, etc.

 So it seems like you don't need Maven.

 Colin

 On 24/08/2015 18:10, kumar rohit wrote:
  yes I develop ontology in Protege, import it in Jena, iterate through its
  triples, and running SPARQL queries using Jena. If this is Jena code
  writing, then what is using jena code?
  Thanks for your time.
 
 
  On Mon, Aug 24, 2015 at 4:54 PM, aj...@virginia.edu aj...@virginia.edu
  wrote:
 
  Are you _using_ Jena's code or _writing new Jena code_? The latter is
  actual development work.
 
  ---
  A. Soroka
  The University of Virginia Library
 
  On Aug 24, 2015, at 11:52 AM, kumar rohit kumar.en...@gmail.com
 wrote:
 
  what does actual development work on Jena,  means? can you
  differentiate
  between Jena as a framework and actual development with Jena?
 
 
  On Mon, Aug 24, 2015 at 4:21 PM, aj...@virginia.edu 
 aj...@virginia.edu
  wrote:
 
  Apache Maven is the build management system used by Jena. You will
 find
  info here:
 
  https://maven.apache.org/
 
  If you are just using Jena as a framework or toolkit you will probably
  not
  need Maven. If you intend to do actual development work on Jena, you
  will
  need to become familiar with Maven. Most Eclipse installs do indeed
 have
  Maven pre-integrated.
 
  ---
  A. Soroka
  The University of Virginia Library
 
  On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com
  wrote:
  What Apache Maven in Eclipse is used for? In what type of application
  we
  need it?
  Also I have found various tutorials showing downloading, installing,
  and
  integrating Maven in Eclipse but I think Eclipse has already Maven
  installed and integrated. Is it so?
 
 




Re: Upgrading from Fuseki 2.0 to 2.3

2015-08-24 Thread Colin Maudry
Thanks Adam!

I guess a System requirements section in here would be useful
https://jena.apache.org/documentation/fuseki2/

I'd be glad to add it if I could.

Colin

On 24/08/2015 19:27, aj...@virginia.edu wrote:
 Fuseki (and the rest of Jena) now requires Java 8. That's the problem you 
 have here.

 ---
 A. Soroka
 The University of Virginia Library

 On Aug 24, 2015, at 1:25 PM, Colin Maudry co...@maudry.com wrote:

 Hello,

 I've been using Fuseki 2.0 for months on Ubuntu 14.04, and realized a
 v2.3 was out.

 1. I downloaded it
 2. Added the necessary permissions chmod u+x fuseki-server
 3. ./fuseki-server --update --mem /datagouvfr

 Exception in thread main java.lang.UnsupportedClassVersionError:
 org/apache/jena/fuseki/cmd/FusekiCmd : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

 $ java -version
 java version 1.7.0_79
 OpenJDK Runtime Environment (IcedTea 2.5.6) (7u79-2.5.6-0ubuntu1.14.04.1)
 OpenJDK Client VM (build 24.79-b02, mixed mode, sharing)

 I tried with only ./fuseki-server (no parameter), same thing.

 Can something be wrong in my system configuration?

 Thanks,
 Colin




Re: Fuseki 2.0 configuration issues

2015-08-24 Thread Andy Seaborne

On 23/08/15 16:23, Adrian Gschwend wrote:

On 09.07.15 11:03, Andy Seaborne wrote:

Hi Andy,



Could this be JENA-918?

  Andy

https://issues.apache.org/jira/browse/JENA-918


oh I didn't see there is a tdb-config as well. I just changed it there
to 5000ms and I still get the error.

Can I somehow see what is set at runtime?


Not really though it would a good idea to annotate the log at the start
of the query with the timeout set.


that would definitely help for debugging. I can't see that in 2.3.0
release, right?


Right.




The snapshot development version is 2.3.0-SNAPHOT

If you could try that, it would be very helpful.

The 3 second timeout is the marker for JENA-918 but if every setting
is different to 3000 then I can't explain the situation.  If you could
send me all the configuration files you use, I will try to recreate this
when I'm back online (I'm away this week)


I still have that on 2.3.0. Do you want just the configuration or also
the data which triggers it? It's gonna be open data so it's not a problem.

Would it be enough to tar the run directory or is this not portable?


Adrian - That would be perfect.  I don't think I need the data, just the 
setup.  It would also be useful to know exactly how you are making the 
call.  A per-query timeout is possible with the header Timeout: or 
parameter timeout=.


Andy



regards

Adrian





Re: About SPARQL predicates as variables...

2015-08-24 Thread Mark Feblowitz
It occurred to me that I had previously tested a related (sub)query and it 
seems very simple and quick, looking for predicates for a given entity:

PREFIX : http://dbpedia.org/resource/
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#

SELECT  (COUNT(DISTINCT ?predicate) as ?PC)
  WHERE {
VALUES ?Entity { :LeBron_James }
{SERVICE SILENT http://dbpedia-live.openlinksw.com/sparql?timeout=4000 
{SELECT DISTINCT * where {
?Entity ?predicate ?Object;
}
}
}}

The idea is to pull out the predicates first and apply them after.

The “VALUES” clause is a shorthand for a local graph clause. I will later use 
something similar to retrieve N ?Entity bindings. The point here is that I will 
have N entities, each with different predicates, and I want to explore the 
relationships for each of the entities.

Note that the binding occurs outside of the service.

A little closer to what I need, another query uses a second source, from 
plain-old dpedia:

SELECT DISTINCT * WHERE {
VALUES ?Entity { :LeBron_James }
{SELECT DISTINCT ?Entity ?predicate
  WHERE {
{SERVICE SILENT http://dbpedia-live.openlinksw.com/sparql?timeout=4000
{SELECT DISTINCT ?Entity ?predicate where {
?Entity ?predicate ?Object;
}
}
}
UNION
{SERVICE SILENT http://dbpedia.openlinksw.com/sparql?timeout=4000 
{SELECT DISTINCT  ?Entity ?predicate where {
?Entity ?predicate ?Object;
}
}
}

}}}

That works pretty rapidly, too.

Finally, and given Andy’s comment about bottom-up processing, I was able to 
write the single endpoint case that works pretty well. It puts the query 
refinements at the top and the predicate generators in a nested query:

SERVICE http://dbpedia-live.openlinksw.com/sparql?timeout=4000  {
{SELECT DISTINCT ?Entity ?predicate ?A ?Person2 where {
?Entity ?predicate ?A.
FILTER ( isURI(?A) )
FILTER(!STRSTARTS(STR(?A), http://dbpedia.org/resource/Template:;))
FILTER(!STRSTARTS(STR(?A), http://dbpedia.org/ontology/wiki;))
FILTER (?A != http://dbpedia.org/resource/Category:Living_people)
FILTER (?A != http://dbpedia.org/property/wordnet_type)
FILTER (?A != http://www.w3.org/2002/07/owl#Thing)
?Person2 ?predicate ?A.
FILTER ( isURI(?Person2))
?Person2 a do:Person.
{SELECT  ?Entity ?predicate WHERE {
VALUES ?Entity { :LeBron_James }
{SELECT DISTINCT * where {
?Entity ?predicate ?Object.
FILTER (?predicate != rdfs99:type)
}
}
}}
}}
}

It needs a bit of tuning, but it’s more responsive than I was expecting. 

Note that I’m filtering out some things that I don’t think are helpful. I would 
have liked to have used something like a VALUES statement to collapse down the 
?A != filters into a “blacklist” but VALUES with negated filters seem only to 
work as a whitelist).

Now. on to finishing off the query: pulling out the VALUES clause and replacing 
it with a local GRAPH query outside of the SERVICE, and then on to replicating 
the query above and UNIONing the two patterns… without breaking the whole lot. 

All in all, the trickiest query I’ve ever crafted (so far).

Thanks to all for your suggestions.

Mark


 On Aug 22, 2015, at 1:27 PM, Andy Seaborne a...@apache.org wrote:
 
 On 22/08/15 15:51, Mark Feblowitz wrote:
 Andy -
 
 I did  try that in isolation, and also directly (not within a SERVICE block) 
 and also directly at the dbpedia sites. Neither worked.
 
 I do see that this form is expensive and have tried it with a number of 
 filters. I sent the very simplest to focus on the main question.
 
 
 If it's the retrieval costs of the query, filters don't help much. Only the 
 simple filters like FILTER (?x = y) can be used to making index scanning 
 more focused.
 
 As an alternative to BIND, you may find
 
 SELECT DISTINCT * where {
 ?Player a do:BasketballPlayer.
 ?Player ?r ?A.
 ?Player2 ?r ?A
 FILTER(?Player = someURI)
 }
 
 helps.  This is optimizable (ARQ does it!) to a BIND-like form
 
 
 SELECT DISTINCT * where {
 someURI a do:BasketballPlayer.
 someURI ?r ?A.
 ?Player2 ?r ?A
 BIND (?Player AS someURI)
 }
 
 now, the optimizer has a chance, not guaranteed though.  An index join to 
 handle ?Player2 ?r ?A means that it's a few probes (the number of 
 properties for subject someURI).  A hash join without conditions however is 
 still very costly for that step.
 
 It's all down to the details of the version of Virtuoso at DBpedia. There is 
 an argument that this style of query is unusual - optimization is about 
 doing things for the likely cases.
 
   Andy
 
 Thanks,
 
 Mark
 
 On Aug 22, 2015,