Re: Clearing cache does not effect query execution time
Comments inline: On 24/08/2015 14:28, Ankur Padia padiaan...@gmail.com wrote: Hello everyone, I want to execute a bunch of queries on Fuseki server using default settings and want to clear cache. To accomplish the goal mentioned, I used following linux command to clean the cache, *sync; echo 3 | sudo tee /proc/sys/vm/drop_caches* However, instead of getting execution time in seconds which means results was fetched from harddisk, I am getting results in millisecond . TDB the storage layer used underneath Fuseki by default uses memory mapped files for its own caching which is separate and distinct from any OS level caching Dropping the OS caches thus has no effect because TDB does not rely on OS caching Also, the very first query takes time in seconds however the later queries following it takes few millisecond both having same number of triple pattern. There is also the issue of JVM class loading, the first few queries will likely cause many classes to be loaded by the JVM which is a one time start up cost paid by any Java application. If you are trying to do benchmarking then you should always run warm-ups to bring the system to a hot state since running from a cold state is always likely to hit discrepancies like this and won't give you reproducible figures Rob Can anyone please guide me through the phenomena happening inside Fuseki ? Fuseki was hosted on linux, Fedora 17. Ankur Padia.
Re: Fuseki 2 HA or on-the-fly backups?
Great info, thanks. Some organisations achieve this by running a load balancer in front of several replicas then co-ordinating the update process. So, they're running the same query against other nodes behind the load balancer to keep things in sync? You can do a live backup So, an HTTP POST /$/backup/*{name}* initiates a backup and that results in a gzip-compressed N-Quads file. What does a restore look like from that file? -J On Mon, Aug 24, 2015 at 4:08 AM, Rob Vesse rve...@dotnetrdf.org wrote: Andy already answered 1 but more on 2 Assuming you use TDB then in-memory checkpointing already happens. TDB caches data into memory but fundamentally is a persistent disk backed database that uses write-ahead logging for transactions and failure recovery so this already happens automatically and is below the level of Fuseki (you get this behaviour wherever you use TDB provided you use it transactionally which Fuseki always does) Rob On 24/08/2015 05:51, Jason Levitt slimands...@gmail.com wrote: Just wondering if there are any projects out there to provide: 1) HA (high availability) configuration of Fuseki such as mirroring or hot/standby failover. 2) Some kind of on-the-fly backup of Fuseki when it's running in RAM. This might be similar to how Hadoop 1.x checkpoints the in-RAM namenode data structures. BTW, are there any tools for testing the consistency of the Fuseki data structures when Fuseki is temporarily halted? Cheers, Jason
Re: Apache Maven
Apache Maven is the build management system used by Jena. You will find info here: https://maven.apache.org/ If you are just using Jena as a framework or toolkit you will probably not need Maven. If you intend to do actual development work on Jena, you will need to become familiar with Maven. Most Eclipse installs do indeed have Maven pre-integrated. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com wrote: What Apache Maven in Eclipse is used for? In what type of application we need it? Also I have found various tutorials showing downloading, installing, and integrating Maven in Eclipse but I think Eclipse has already Maven installed and integrated. Is it so?
Apache Maven
What Apache Maven in Eclipse is used for? In what type of application we need it? Also I have found various tutorials showing downloading, installing, and integrating Maven in Eclipse but I think Eclipse has already Maven installed and integrated. Is it so?
Re: Clearing cache does not effect query execution time
There is another important cache - the node table cache is not a filing system cache at all. It's in-process, in Java, cache. It caches what is otherwise random I/O and potentially expensive. Depends on what your queries are as to which caching effects dominate execution. SELECT * { ?s ?p ?o } is almost all node table - the index is a scan and very read-friendly even when cold. This is especially true if the database has had little in the way of updates since first built - the loaders generate some indexes, like the one that query will use, in large, disk aligned units. The only way to get reliable (and realistic - servers run for a long time) results is to pre-warm the system as Rob says. Andy On 24/08/15 15:09, Rob Vesse wrote: Comments inline: On 24/08/2015 14:28, Ankur Padia padiaan...@gmail.com wrote: Hello everyone, I want to execute a bunch of queries on Fuseki server using default settings and want to clear cache. To accomplish the goal mentioned, I used following linux command to clean the cache, *sync; echo 3 | sudo tee /proc/sys/vm/drop_caches* However, instead of getting execution time in seconds which means results was fetched from harddisk, I am getting results in millisecond . TDB the storage layer used underneath Fuseki by default uses memory mapped files for its own caching which is separate and distinct from any OS level caching Dropping the OS caches thus has no effect because TDB does not rely on OS caching Also, the very first query takes time in seconds however the later queries following it takes few millisecond both having same number of triple pattern. There is also the issue of JVM class loading, the first few queries will likely cause many classes to be loaded by the JVM which is a one time start up cost paid by any Java application. If you are trying to do benchmarking then you should always run warm-ups to bring the system to a hot state since running from a cold state is always likely to hit discrepancies like this and won't give you reproducible figures Rob Can anyone please guide me through the phenomena happening inside Fuseki ? Fuseki was hosted on linux, Fedora 17. Ankur Padia.
Re: Fuseki 2 HA or on-the-fly backups?
On 24/08/15 16:15, Jason Levitt wrote: Great info, thanks. Some organisations achieve this by running a load balancer in front of several replicas then co-ordinating the update process. So, they're running the same query against other nodes behind the load balancer to keep things in sync? You can do a live backup So, an HTTP POST /$/backup/*{name}* initiates a backup and that results in a gzip-compressed N-Quads file. What does a restore look like from that file? You just load it into an empty database (tdbloader etc). Andy -J On Mon, Aug 24, 2015 at 4:08 AM, Rob Vesse rve...@dotnetrdf.org wrote: Andy already answered 1 but more on 2 Assuming you use TDB then in-memory checkpointing already happens. TDB caches data into memory but fundamentally is a persistent disk backed database that uses write-ahead logging for transactions and failure recovery so this already happens automatically and is below the level of Fuseki (you get this behaviour wherever you use TDB provided you use it transactionally which Fuseki always does) Rob On 24/08/2015 05:51, Jason Levitt slimands...@gmail.com wrote: Just wondering if there are any projects out there to provide: 1) HA (high availability) configuration of Fuseki such as mirroring or hot/standby failover. 2) Some kind of on-the-fly backup of Fuseki when it's running in RAM. This might be similar to how Hadoop 1.x checkpoints the in-RAM namenode data structures. BTW, are there any tools for testing the consistency of the Fuseki data structures when Fuseki is temporarily halted? Cheers, Jason
Re: Apache Maven
what does actual development work on Jena, means? can you differentiate between Jena as a framework and actual development with Jena? On Mon, Aug 24, 2015 at 4:21 PM, aj...@virginia.edu aj...@virginia.edu wrote: Apache Maven is the build management system used by Jena. You will find info here: https://maven.apache.org/ If you are just using Jena as a framework or toolkit you will probably not need Maven. If you intend to do actual development work on Jena, you will need to become familiar with Maven. Most Eclipse installs do indeed have Maven pre-integrated. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com wrote: What Apache Maven in Eclipse is used for? In what type of application we need it? Also I have found various tutorials showing downloading, installing, and integrating Maven in Eclipse but I think Eclipse has already Maven installed and integrated. Is it so?
Re: Apache Maven
Hi Kumar, What you describe is using Jena. Doing development means writing code to extend Jena functionality with plugins, bug fixes, etc. So it seems like you don't need Maven. Colin On 24/08/2015 18:10, kumar rohit wrote: yes I develop ontology in Protege, import it in Jena, iterate through its triples, and running SPARQL queries using Jena. If this is Jena code writing, then what is using jena code? Thanks for your time. On Mon, Aug 24, 2015 at 4:54 PM, aj...@virginia.edu aj...@virginia.edu wrote: Are you _using_ Jena's code or _writing new Jena code_? The latter is actual development work. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 11:52 AM, kumar rohit kumar.en...@gmail.com wrote: what does actual development work on Jena, means? can you differentiate between Jena as a framework and actual development with Jena? On Mon, Aug 24, 2015 at 4:21 PM, aj...@virginia.edu aj...@virginia.edu wrote: Apache Maven is the build management system used by Jena. You will find info here: https://maven.apache.org/ If you are just using Jena as a framework or toolkit you will probably not need Maven. If you intend to do actual development work on Jena, you will need to become familiar with Maven. Most Eclipse installs do indeed have Maven pre-integrated. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com wrote: What Apache Maven in Eclipse is used for? In what type of application we need it? Also I have found various tutorials showing downloading, installing, and integrating Maven in Eclipse but I think Eclipse has already Maven installed and integrated. Is it so?
Re: Fuseki 2.0 configuration issues
On 24.08.15 13:05, Andy Seaborne wrote: Hi Andy, Adrian - That would be perfect. I don't think I need the data, just the setup. It would also be useful to know exactly how you are making the call. A per-query timeout is possible with the header Timeout: or parameter timeout=. I use this curl command to execute it: curl -H Accept: application/n-triples --data-urlencode query@construct/map_municipality2classes.sparql http://localhost:3030/bfs/sparql -o out/map_municipality2classes.nt Didn't check the exact header sent though. I've tared everything including the data, it's just 14MB compressed. Execute fuseki-server within the fuseki subdirectory, the shell-script fuseki-construct.sh fires a bunch of construct queries into out directory. http://ktk.netlabs.org/misc/fuseki-timeout.tar.bz2 Note that the one I mention above is commented-out right now in this shell script. regards Adrian
Re: Apache Maven
I got it.. thanks On Mon, Aug 24, 2015 at 5:32 PM, Colin Maudry co...@maudry.com wrote: Hi Kumar, What you describe is using Jena. Doing development means writing code to extend Jena functionality with plugins, bug fixes, etc. So it seems like you don't need Maven. Colin On 24/08/2015 18:10, kumar rohit wrote: yes I develop ontology in Protege, import it in Jena, iterate through its triples, and running SPARQL queries using Jena. If this is Jena code writing, then what is using jena code? Thanks for your time. On Mon, Aug 24, 2015 at 4:54 PM, aj...@virginia.edu aj...@virginia.edu wrote: Are you _using_ Jena's code or _writing new Jena code_? The latter is actual development work. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 11:52 AM, kumar rohit kumar.en...@gmail.com wrote: what does actual development work on Jena, means? can you differentiate between Jena as a framework and actual development with Jena? On Mon, Aug 24, 2015 at 4:21 PM, aj...@virginia.edu aj...@virginia.edu wrote: Apache Maven is the build management system used by Jena. You will find info here: https://maven.apache.org/ If you are just using Jena as a framework or toolkit you will probably not need Maven. If you intend to do actual development work on Jena, you will need to become familiar with Maven. Most Eclipse installs do indeed have Maven pre-integrated. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 10:29 AM, kumar rohit kumar.en...@gmail.com wrote: What Apache Maven in Eclipse is used for? In what type of application we need it? Also I have found various tutorials showing downloading, installing, and integrating Maven in Eclipse but I think Eclipse has already Maven installed and integrated. Is it so?
Re: Upgrading from Fuseki 2.0 to 2.3
Thanks Adam! I guess a System requirements section in here would be useful https://jena.apache.org/documentation/fuseki2/ I'd be glad to add it if I could. Colin On 24/08/2015 19:27, aj...@virginia.edu wrote: Fuseki (and the rest of Jena) now requires Java 8. That's the problem you have here. --- A. Soroka The University of Virginia Library On Aug 24, 2015, at 1:25 PM, Colin Maudry co...@maudry.com wrote: Hello, I've been using Fuseki 2.0 for months on Ubuntu 14.04, and realized a v2.3 was out. 1. I downloaded it 2. Added the necessary permissions chmod u+x fuseki-server 3. ./fuseki-server --update --mem /datagouvfr Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/jena/fuseki/cmd/FusekiCmd : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482) $ java -version java version 1.7.0_79 OpenJDK Runtime Environment (IcedTea 2.5.6) (7u79-2.5.6-0ubuntu1.14.04.1) OpenJDK Client VM (build 24.79-b02, mixed mode, sharing) I tried with only ./fuseki-server (no parameter), same thing. Can something be wrong in my system configuration? Thanks, Colin
Re: Fuseki 2.0 configuration issues
On 23/08/15 16:23, Adrian Gschwend wrote: On 09.07.15 11:03, Andy Seaborne wrote: Hi Andy, Could this be JENA-918? Andy https://issues.apache.org/jira/browse/JENA-918 oh I didn't see there is a tdb-config as well. I just changed it there to 5000ms and I still get the error. Can I somehow see what is set at runtime? Not really though it would a good idea to annotate the log at the start of the query with the timeout set. that would definitely help for debugging. I can't see that in 2.3.0 release, right? Right. The snapshot development version is 2.3.0-SNAPHOT If you could try that, it would be very helpful. The 3 second timeout is the marker for JENA-918 but if every setting is different to 3000 then I can't explain the situation. If you could send me all the configuration files you use, I will try to recreate this when I'm back online (I'm away this week) I still have that on 2.3.0. Do you want just the configuration or also the data which triggers it? It's gonna be open data so it's not a problem. Would it be enough to tar the run directory or is this not portable? Adrian - That would be perfect. I don't think I need the data, just the setup. It would also be useful to know exactly how you are making the call. A per-query timeout is possible with the header Timeout: or parameter timeout=. Andy regards Adrian
Re: About SPARQL predicates as variables...
It occurred to me that I had previously tested a related (sub)query and it seems very simple and quick, looking for predicates for a given entity: PREFIX : http://dbpedia.org/resource/ PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# SELECT (COUNT(DISTINCT ?predicate) as ?PC) WHERE { VALUES ?Entity { :LeBron_James } {SERVICE SILENT http://dbpedia-live.openlinksw.com/sparql?timeout=4000 {SELECT DISTINCT * where { ?Entity ?predicate ?Object; } } }} The idea is to pull out the predicates first and apply them after. The “VALUES” clause is a shorthand for a local graph clause. I will later use something similar to retrieve N ?Entity bindings. The point here is that I will have N entities, each with different predicates, and I want to explore the relationships for each of the entities. Note that the binding occurs outside of the service. A little closer to what I need, another query uses a second source, from plain-old dpedia: SELECT DISTINCT * WHERE { VALUES ?Entity { :LeBron_James } {SELECT DISTINCT ?Entity ?predicate WHERE { {SERVICE SILENT http://dbpedia-live.openlinksw.com/sparql?timeout=4000 {SELECT DISTINCT ?Entity ?predicate where { ?Entity ?predicate ?Object; } } } UNION {SERVICE SILENT http://dbpedia.openlinksw.com/sparql?timeout=4000 {SELECT DISTINCT ?Entity ?predicate where { ?Entity ?predicate ?Object; } } } }}} That works pretty rapidly, too. Finally, and given Andy’s comment about bottom-up processing, I was able to write the single endpoint case that works pretty well. It puts the query refinements at the top and the predicate generators in a nested query: SERVICE http://dbpedia-live.openlinksw.com/sparql?timeout=4000 { {SELECT DISTINCT ?Entity ?predicate ?A ?Person2 where { ?Entity ?predicate ?A. FILTER ( isURI(?A) ) FILTER(!STRSTARTS(STR(?A), http://dbpedia.org/resource/Template:;)) FILTER(!STRSTARTS(STR(?A), http://dbpedia.org/ontology/wiki;)) FILTER (?A != http://dbpedia.org/resource/Category:Living_people) FILTER (?A != http://dbpedia.org/property/wordnet_type) FILTER (?A != http://www.w3.org/2002/07/owl#Thing) ?Person2 ?predicate ?A. FILTER ( isURI(?Person2)) ?Person2 a do:Person. {SELECT ?Entity ?predicate WHERE { VALUES ?Entity { :LeBron_James } {SELECT DISTINCT * where { ?Entity ?predicate ?Object. FILTER (?predicate != rdfs99:type) } } }} }} } It needs a bit of tuning, but it’s more responsive than I was expecting. Note that I’m filtering out some things that I don’t think are helpful. I would have liked to have used something like a VALUES statement to collapse down the ?A != filters into a “blacklist” but VALUES with negated filters seem only to work as a whitelist). Now. on to finishing off the query: pulling out the VALUES clause and replacing it with a local GRAPH query outside of the SERVICE, and then on to replicating the query above and UNIONing the two patterns… without breaking the whole lot. All in all, the trickiest query I’ve ever crafted (so far). Thanks to all for your suggestions. Mark On Aug 22, 2015, at 1:27 PM, Andy Seaborne a...@apache.org wrote: On 22/08/15 15:51, Mark Feblowitz wrote: Andy - I did try that in isolation, and also directly (not within a SERVICE block) and also directly at the dbpedia sites. Neither worked. I do see that this form is expensive and have tried it with a number of filters. I sent the very simplest to focus on the main question. If it's the retrieval costs of the query, filters don't help much. Only the simple filters like FILTER (?x = y) can be used to making index scanning more focused. As an alternative to BIND, you may find SELECT DISTINCT * where { ?Player a do:BasketballPlayer. ?Player ?r ?A. ?Player2 ?r ?A FILTER(?Player = someURI) } helps. This is optimizable (ARQ does it!) to a BIND-like form SELECT DISTINCT * where { someURI a do:BasketballPlayer. someURI ?r ?A. ?Player2 ?r ?A BIND (?Player AS someURI) } now, the optimizer has a chance, not guaranteed though. An index join to handle ?Player2 ?r ?A means that it's a few probes (the number of properties for subject someURI). A hash join without conditions however is still very costly for that step. It's all down to the details of the version of Virtuoso at DBpedia. There is an argument that this style of query is unusual - optimization is about doing things for the likely cases. Andy Thanks, Mark On Aug 22, 2015,