Just a quick note. There is a cassandra implementation but no work has been done on performance tuning.
On a second note. I did some work using bloom filters to do partitioning that allows adding partitions on demand. Should work for triple store partitioning as well. Claude On Wed, May 30, 2018, 8:43 AM Siddhesh Rane <kingsid...@gmail.com> wrote: > For my undergraduate project I used Fuseki 3.6.0 server backed by a TDB > dataset. > 30000 unique SPARQL queries were made against the server by 6 nodes in > a Spark cluster returning a total of 150 million triples. > As I used DBpedia's dataset, nearly all the entities from Wikipedia > were covered, so my experiment is somewhat like an exhaustive test. > > I write as someone quite new to Jena and SPARQL itself so I may have > faced problems for doing things the wrong way, or not knowing better > solutions. > Although Fuseki was a critical component in my pipeline, I could not > spend much time on learning it properly, so kindly forgive any > ignorance on my part. > I hope my experience will help the developers to know the ways in > which at least newcomers are using this software. > > > DATA INGESTION > > This was the most tedious part about using Jena. > The ability to create a TDB database and upload data to it, all from > the browser, is a really nice feature. The difficult part is that the > memory required to do so is proportional to the size of the data being > uploaded. The largest file that I was trying to upload contained 158M > triples (24GB uncompressed, 1.5GB bz2 compressed) and it was > frequently running out of memory. I had to set Fuseki to -Xmx32g and > only then did it work. Command line tools faced the same problem. > > Another thing is that both the web interface and command line tools > optionally accept gzip files, but not bzip2 wheareas bzip2 is used by > both Wikipedia and DBpedia for their data dumps. > I tried to work around the issue by `bzcat file.ttl.bz2 | gzip > > named-pipe` and then using the named-pipe for data ingestion but that > did not work. > > I finally ended up using `tdbloader2` which works with constant memory > and, as I read somewhere on the mailing list, produces the smallest > size database. > There might be some SPARQL way for inserting data in batches and I > probably could have scripted that but I had a project to complete and > so went with what appeared to be the most straightforward way of doing > things. > > Performance of tdbloader2: > On my 2017 Spectre x360 laptop with 16GB RAM, dual core i7-7500U cpu > and 512GB SSD > Phase I: 199,597,131 tuples; 2,941.31 seconds ; 67,860.02 tuples/sec > Totat time 4609s > > I wanted to control what indexes are generated because I knew the > access pattern of my SPARQL queries and also wanted smaller DB size. > I think there are toggles to decide what indexes are generated but I > did not try to search much. > Controlling the indexes from tdbloader2 itself would be a great > option. If this feature already exists, please let me know. > > Another point of confusion is the version of the backing TDB database. > The `bin` folder contains tdb commands for both v1 and v2 but I'm not > sure what version does Fuseki use when I create a persistent store. > The database config file `config.ttl` does not mention any version. > I would appreciate if someone could clear up this confusion for me. > > TL;DR please support bzip2, reading from pipes and constant memory > loading operations. > > > DATABASE PERFORMANCE, REPLICATION and SHARDING > > My project used Spark to distribute the load among a cluster of > machines. The input data was all the articles in Wikipedia. Each > partition of the data would contain about 250 articles. The first > SPARQL query was to DESCRIBE these articles. A subsequent CONSTRUCT > query would fetch the labels for all the object resources in the model > returned by the first query. There were 44 cores in the cluster so at > any time 44 partitions would generate 44*2=88 SPARQL queries. The > DESCRIBE query would run in milliseconds whereas the CONSTRUCT query > would take 1-2 seconds, because of the random access nature. > Benchmarks were not comprehensive, just observation of log output. > > I got this performance when the entire database was resident in RAM, > as reported by `vmtouch` (https://hoytech.com/vmtouch/) > Without complete memory mapping, the performance would degrade to > 500-1000 seconds per CONSTRUCT query. In my case the db was 16-19GB in > size so it could be `vmtouched` in RAM on a 32GB RAM 8 core Xeon > machine. > > To increase performance further I replicated the db on an identical > machine and load balanced queries between the two machines. The > execution time of my entire Spark app went down from 2 hours to 1 > hour. A recent thread on this list talks about high availability and > replication. You can just assign different threads to query different > replicas of the db with fallback on the other and that would be > sufficient in most cases. > > Memory mapping with vmtouch and replication are easy performance wins > but they have their limitations. > They worked in this case because my DB size was quite small to fit on > a 32 GB machine and I had such machines available at my disposal. > As data size keeps increasing exponentially everyday, memory sizes > cannot keep up and no single machine can hold the entire database. > My cluster had a combined capacity of around 80 GB and if you leave > out 12 GB that was used by the Spark workers, there was 68 GB of RAM > available which could be used to hold the database. > But since each individual node had <8 GB RAM, the db could not fit on > any of them. > The solution here is that Fuseki itself should support some form of > sharding, so that it can access pieces of the index stored on lots of > machines with small amount of RAM. > Alternately, providing integration with other dbs like Redis, > Cassandra, or even Spark with GraphX would be a better solution to > scale to bigger datasets. > > Jena's prime features for me were the Fuseki server, the Java library > to operate on local models and, even though I did not use it but plan > to, the inference features. > I don't think database is something that the developers should spend > their energy on. Rather than turning TDB into a full blown distributed > DB, it would be better to integrate with one of the open source ones > already available. > > TL;DR Database sharding is needed for scaling > > > QUERY PERFORMANCE > > I have noticed that combining the result of two or more simple queries > is orders of magnitude faster than a single query which performs the > same function. > Take this example. > > # parent+child count > select ?type (count(?obj) as ?c) where { > ?type rdfs:subClassOf owl:Thing . > ?obj a/rdfs:subClassOf? ?type > } > group by ?type > > This query took 1022 seconds to execute. Inference features were not > used for any of the queries. > > We can split this into 2 simpler queries > > # parent count took 2.57 seconds > select ?type (count(?obj) as ?c) where { > ?type rdfs:subClassOf owl:Thing . > ?obj a ?type > } > group by ?type > > and > > # child count took 10.86 seconds > select ?type (count(?obj) as ?c) where { > ?type rdfs:subClassOf owl:Thing . > ?obj a/rdfs:subClassOf ?type > } > group by ?type > > > The simpler queries took a combined time of 2.57+10.86=13.43 seconds > when fired sequentially. > In a load balanced scheme each query would be served by a different > server and then the execution time of the longest query becomes the > bottleneck. > The combined query is 1022/13.46= ~76 times slower (1022/10.86=92 > times slower than parallel execution). > > Query optimization is something that is needed for SQL even after so > many years of being on the market. > SPARQL is relatively new so it is understandable that query > optimization may not be that developed yet. > > A small bug I encountered. > During execution some of my queries got malformed because FQDN IRIs > are shortened to prefix:resource format in turtle. > I believe this happens with IRIs that do not conform to standards. > DBpedia has lots of them. > I got around that by removing the PREFIX dbr: ... entry from the query > so IRIs were not re written. > > > CONCLUSION > > The fact that I could use Jena without losing focus on my project > shows how it gets out of your way and does its job. > Apart from the initial data ingestion hurdles, I have not had any > issues with using Jena. > The Fuseki interface was a crucial component for me to learn SPARQL by > example. > Overall it was a pleasure to use. > Kudos to the developers! >