Thanks Bruno, I haven't yet written a paper. I'm in the process of experimenting more.
Regards, Siddhesh On Thu, May 31, 2018 at 6:41 AM, Claude Warren <cla...@xenei.com> wrote: > Just a quick note. There is a cassandra implementation but no work has > been done on performance tuning. > > On a second note. I did some work using bloom filters to do partitioning > that allows adding partitions on demand. Should work for triple store > partitioning as well. > > Claude > > On Wed, May 30, 2018, 8:43 AM Siddhesh Rane <kingsid...@gmail.com> wrote: > >> For my undergraduate project I used Fuseki 3.6.0 server backed by a TDB >> dataset. >> 30000 unique SPARQL queries were made against the server by 6 nodes in >> a Spark cluster returning a total of 150 million triples. >> As I used DBpedia's dataset, nearly all the entities from Wikipedia >> were covered, so my experiment is somewhat like an exhaustive test. >> >> I write as someone quite new to Jena and SPARQL itself so I may have >> faced problems for doing things the wrong way, or not knowing better >> solutions. >> Although Fuseki was a critical component in my pipeline, I could not >> spend much time on learning it properly, so kindly forgive any >> ignorance on my part. >> I hope my experience will help the developers to know the ways in >> which at least newcomers are using this software. >> >> >> DATA INGESTION >> >> This was the most tedious part about using Jena. >> The ability to create a TDB database and upload data to it, all from >> the browser, is a really nice feature. The difficult part is that the >> memory required to do so is proportional to the size of the data being >> uploaded. The largest file that I was trying to upload contained 158M >> triples (24GB uncompressed, 1.5GB bz2 compressed) and it was >> frequently running out of memory. I had to set Fuseki to -Xmx32g and >> only then did it work. Command line tools faced the same problem. >> >> Another thing is that both the web interface and command line tools >> optionally accept gzip files, but not bzip2 wheareas bzip2 is used by >> both Wikipedia and DBpedia for their data dumps. >> I tried to work around the issue by `bzcat file.ttl.bz2 | gzip > >> named-pipe` and then using the named-pipe for data ingestion but that >> did not work. >> >> I finally ended up using `tdbloader2` which works with constant memory >> and, as I read somewhere on the mailing list, produces the smallest >> size database. >> There might be some SPARQL way for inserting data in batches and I >> probably could have scripted that but I had a project to complete and >> so went with what appeared to be the most straightforward way of doing >> things. >> >> Performance of tdbloader2: >> On my 2017 Spectre x360 laptop with 16GB RAM, dual core i7-7500U cpu >> and 512GB SSD >> Phase I: 199,597,131 tuples; 2,941.31 seconds ; 67,860.02 tuples/sec >> Totat time 4609s >> >> I wanted to control what indexes are generated because I knew the >> access pattern of my SPARQL queries and also wanted smaller DB size. >> I think there are toggles to decide what indexes are generated but I >> did not try to search much. >> Controlling the indexes from tdbloader2 itself would be a great >> option. If this feature already exists, please let me know. >> >> Another point of confusion is the version of the backing TDB database. >> The `bin` folder contains tdb commands for both v1 and v2 but I'm not >> sure what version does Fuseki use when I create a persistent store. >> The database config file `config.ttl` does not mention any version. >> I would appreciate if someone could clear up this confusion for me. >> >> TL;DR please support bzip2, reading from pipes and constant memory >> loading operations. >> >> >> DATABASE PERFORMANCE, REPLICATION and SHARDING >> >> My project used Spark to distribute the load among a cluster of >> machines. The input data was all the articles in Wikipedia. Each >> partition of the data would contain about 250 articles. The first >> SPARQL query was to DESCRIBE these articles. A subsequent CONSTRUCT >> query would fetch the labels for all the object resources in the model >> returned by the first query. There were 44 cores in the cluster so at >> any time 44 partitions would generate 44*2=88 SPARQL queries. The >> DESCRIBE query would run in milliseconds whereas the CONSTRUCT query >> would take 1-2 seconds, because of the random access nature. >> Benchmarks were not comprehensive, just observation of log output. >> >> I got this performance when the entire database was resident in RAM, >> as reported by `vmtouch` (https://hoytech.com/vmtouch/) >> Without complete memory mapping, the performance would degrade to >> 500-1000 seconds per CONSTRUCT query. In my case the db was 16-19GB in >> size so it could be `vmtouched` in RAM on a 32GB RAM 8 core Xeon >> machine. >> >> To increase performance further I replicated the db on an identical >> machine and load balanced queries between the two machines. The >> execution time of my entire Spark app went down from 2 hours to 1 >> hour. A recent thread on this list talks about high availability and >> replication. You can just assign different threads to query different >> replicas of the db with fallback on the other and that would be >> sufficient in most cases. >> >> Memory mapping with vmtouch and replication are easy performance wins >> but they have their limitations. >> They worked in this case because my DB size was quite small to fit on >> a 32 GB machine and I had such machines available at my disposal. >> As data size keeps increasing exponentially everyday, memory sizes >> cannot keep up and no single machine can hold the entire database. >> My cluster had a combined capacity of around 80 GB and if you leave >> out 12 GB that was used by the Spark workers, there was 68 GB of RAM >> available which could be used to hold the database. >> But since each individual node had <8 GB RAM, the db could not fit on >> any of them. >> The solution here is that Fuseki itself should support some form of >> sharding, so that it can access pieces of the index stored on lots of >> machines with small amount of RAM. >> Alternately, providing integration with other dbs like Redis, >> Cassandra, or even Spark with GraphX would be a better solution to >> scale to bigger datasets. >> >> Jena's prime features for me were the Fuseki server, the Java library >> to operate on local models and, even though I did not use it but plan >> to, the inference features. >> I don't think database is something that the developers should spend >> their energy on. Rather than turning TDB into a full blown distributed >> DB, it would be better to integrate with one of the open source ones >> already available. >> >> TL;DR Database sharding is needed for scaling >> >> >> QUERY PERFORMANCE >> >> I have noticed that combining the result of two or more simple queries >> is orders of magnitude faster than a single query which performs the >> same function. >> Take this example. >> >> # parent+child count >> select ?type (count(?obj) as ?c) where { >> ?type rdfs:subClassOf owl:Thing . >> ?obj a/rdfs:subClassOf? ?type >> } >> group by ?type >> >> This query took 1022 seconds to execute. Inference features were not >> used for any of the queries. >> >> We can split this into 2 simpler queries >> >> # parent count took 2.57 seconds >> select ?type (count(?obj) as ?c) where { >> ?type rdfs:subClassOf owl:Thing . >> ?obj a ?type >> } >> group by ?type >> >> and >> >> # child count took 10.86 seconds >> select ?type (count(?obj) as ?c) where { >> ?type rdfs:subClassOf owl:Thing . >> ?obj a/rdfs:subClassOf ?type >> } >> group by ?type >> >> >> The simpler queries took a combined time of 2.57+10.86=13.43 seconds >> when fired sequentially. >> In a load balanced scheme each query would be served by a different >> server and then the execution time of the longest query becomes the >> bottleneck. >> The combined query is 1022/13.46= ~76 times slower (1022/10.86=92 >> times slower than parallel execution). >> >> Query optimization is something that is needed for SQL even after so >> many years of being on the market. >> SPARQL is relatively new so it is understandable that query >> optimization may not be that developed yet. >> >> A small bug I encountered. >> During execution some of my queries got malformed because FQDN IRIs >> are shortened to prefix:resource format in turtle. >> I believe this happens with IRIs that do not conform to standards. >> DBpedia has lots of them. >> I got around that by removing the PREFIX dbr: ... entry from the query >> so IRIs were not re written. >> >> >> CONCLUSION >> >> The fact that I could use Jena without losing focus on my project >> shows how it gets out of your way and does its job. >> Apart from the initial data ingestion hurdles, I have not had any >> issues with using Jena. >> The Fuseki interface was a crucial component for me to learn SPARQL by >> example. >> Overall it was a pleasure to use. >> Kudos to the developers! >> -- Your greatest regret is the email ID you choose in 8th grade