Re: Jena Usage Report: Performance, bugs and feature requests

Siddhesh Rane Thu, 31 May 2018 06:18:59 -0700

Thanks Bruno,

I haven't yet written a paper. I'm in the process of experimenting more.


Regards,
Siddhesh

On Thu, May 31, 2018 at 6:41 AM, Claude Warren <cla...@xenei.com> wrote:
> Just a quick note.  There is a cassandra implementation but no work has
> been done on performance tuning.
>
> On a second note.  I did some work using bloom filters to do partitioning
> that allows adding partitions on demand.  Should work for triple store
> partitioning as well.
>
> Claude
>
> On Wed, May 30, 2018, 8:43 AM Siddhesh Rane <kingsid...@gmail.com> wrote:
>
>> For my undergraduate project I used Fuseki 3.6.0 server backed by a TDB
>> dataset.
>> 30000 unique SPARQL queries were made against the server by 6 nodes in
>> a Spark cluster returning a total of 150 million triples.
>> As I used DBpedia's dataset, nearly all the entities from Wikipedia
>> were covered, so my experiment is somewhat like an exhaustive test.
>>
>> I write as someone quite new to Jena and SPARQL itself so I may have
>> faced problems for doing things the wrong way, or not knowing better
>> solutions.
>> Although Fuseki was a critical component in my pipeline, I could not
>> spend much time on learning it properly, so kindly forgive any
>> ignorance on my part.
>> I hope my experience will help the developers to know the ways in
>> which at least newcomers are using this software.
>>
>>
>>     DATA INGESTION
>>
>> This was the most tedious part about using Jena.
>> The ability to create a TDB database and upload data to it, all from
>> the browser, is a really nice feature. The difficult part is that the
>> memory required to do so is proportional to the size of the data being
>> uploaded. The largest file that I was trying to upload contained 158M
>> triples (24GB uncompressed, 1.5GB bz2 compressed) and it was
>> frequently running out of memory. I had to set Fuseki to -Xmx32g and
>> only then did it work. Command line tools faced the same problem.
>>
>> Another thing is that both the web interface and command line tools
>> optionally accept gzip files, but not bzip2 wheareas bzip2 is used by
>> both Wikipedia and DBpedia for their data dumps.
>> I tried to work around the issue by `bzcat file.ttl.bz2 | gzip >
>> named-pipe` and then using the named-pipe for data ingestion but that
>> did not work.
>>
>> I finally ended up using `tdbloader2` which works with constant memory
>> and, as I read somewhere on the mailing list, produces the smallest
>> size database.
>> There might be some SPARQL way for inserting data in batches and I
>> probably could have scripted that but I had a project to complete and
>> so went with what appeared to be the most straightforward way of doing
>> things.
>>
>> Performance of tdbloader2:
>> On my 2017 Spectre x360 laptop with 16GB RAM, dual core i7-7500U cpu
>> and 512GB SSD
>> Phase I: 199,597,131 tuples; 2,941.31 seconds ; 67,860.02 tuples/sec
>> Totat time 4609s
>>
>> I wanted to control what indexes are generated because I knew the
>> access pattern of my SPARQL queries and also wanted smaller DB size.
>> I think there are toggles to decide what indexes are generated but I
>> did not try to search much.
>> Controlling the indexes from tdbloader2 itself would be a great
>> option. If this feature already exists, please let me know.
>>
>> Another point of confusion is the version of the backing TDB database.
>> The `bin` folder contains tdb commands for both v1 and v2 but I'm not
>> sure what version does Fuseki use when I create a persistent store.
>> The database config file `config.ttl` does not mention any version.
>> I would appreciate if someone could clear up this confusion for me.
>>
>> TL;DR please support bzip2, reading from pipes and constant memory
>> loading operations.
>>
>>
>>     DATABASE PERFORMANCE, REPLICATION and SHARDING
>>
>> My project used Spark to distribute the load among a cluster of
>> machines. The input data was all the articles in Wikipedia. Each
>> partition of the data would contain about 250 articles. The first
>> SPARQL query was to DESCRIBE these articles. A subsequent CONSTRUCT
>> query would fetch the labels for all the object resources in the model
>> returned by the first query. There were 44 cores in the cluster so at
>> any time 44 partitions would generate 44*2=88 SPARQL queries. The
>> DESCRIBE query would run in milliseconds whereas the CONSTRUCT query
>> would take 1-2 seconds, because of the random access nature.
>> Benchmarks were not comprehensive, just observation of log output.
>>
>> I got this performance when the entire database was resident in RAM,
>> as reported by `vmtouch` (https://hoytech.com/vmtouch/)
>> Without complete memory mapping, the performance would degrade to
>> 500-1000 seconds per CONSTRUCT query. In my case the db was 16-19GB in
>> size so it could be `vmtouched` in RAM on a 32GB RAM 8 core Xeon
>> machine.
>>
>> To increase performance further I replicated the db on an identical
>> machine and load balanced queries between the two machines. The
>> execution time of my entire Spark app went down from 2 hours to 1
>> hour. A recent thread on this list talks about high availability and
>> replication. You can just assign different threads to query different
>> replicas of the db with fallback on the other and that would be
>> sufficient in most cases.
>>
>> Memory mapping with vmtouch and replication are easy performance wins
>> but they have their limitations.
>> They worked in this case because my DB size was quite small to fit on
>> a 32 GB machine and I had such machines available at my disposal.
>> As data size keeps increasing exponentially everyday, memory sizes
>> cannot keep up and no single machine can hold the entire database.
>> My cluster had a combined capacity of around 80 GB and if you leave
>> out 12 GB that was used by the Spark workers, there was 68 GB of RAM
>> available which could be used to hold the database.
>> But since each individual node had <8 GB RAM, the db could not fit on
>> any of them.
>> The solution here is that Fuseki itself should support some form of
>> sharding, so that it can access pieces of the index stored on lots of
>> machines with small amount of RAM.
>> Alternately, providing integration with other dbs like Redis,
>> Cassandra, or even Spark with GraphX would be a better solution to
>> scale to bigger datasets.
>>
>> Jena's prime features for me were the Fuseki server, the Java library
>> to operate on local models and, even though I did not use it but plan
>> to, the inference features.
>> I don't think database is something that the developers should spend
>> their energy on. Rather than turning TDB into a full blown distributed
>> DB, it would be better to integrate with one of the open source ones
>> already available.
>>
>> TL;DR   Database sharding is needed for scaling
>>
>>
>>         QUERY PERFORMANCE
>>
>> I have noticed that combining the result of two or more simple queries
>> is orders of magnitude faster than a single query which performs the
>> same function.
>> Take this example.
>>
>>         # parent+child count
>>         select ?type (count(?obj) as ?c) where {
>>           ?type rdfs:subClassOf owl:Thing .
>>           ?obj a/rdfs:subClassOf? ?type
>>         }
>>         group by ?type
>>
>> This query took 1022 seconds to execute. Inference features were not
>> used for any of the queries.
>>
>> We can split this into 2 simpler queries
>>
>>         # parent count took 2.57 seconds
>>         select ?type (count(?obj) as ?c) where {
>>           ?type rdfs:subClassOf owl:Thing .
>>           ?obj a ?type
>>         }
>>         group by ?type
>>
>> and
>>
>>         # child count took 10.86 seconds
>>         select ?type (count(?obj) as ?c) where {
>>           ?type rdfs:subClassOf owl:Thing .
>>           ?obj a/rdfs:subClassOf ?type
>>         }
>>         group by ?type
>>
>>
>> The simpler queries took a combined time of 2.57+10.86=13.43 seconds
>> when fired sequentially.
>> In a load balanced scheme each query would be served by a different
>> server and then the execution time of the longest query becomes the
>> bottleneck.
>> The combined query is 1022/13.46= ~76 times slower (1022/10.86=92
>> times slower than parallel execution).
>>
>> Query optimization is something that is needed for SQL even after so
>> many years of being on the market.
>> SPARQL is relatively new so it is understandable that query
>> optimization may not be that developed yet.
>>
>> A small bug I encountered.
>> During execution some of my queries got malformed because FQDN IRIs
>> are shortened to prefix:resource format in turtle.
>> I believe this happens with IRIs that do not conform to standards.
>> DBpedia has lots of them.
>> I got around that by removing the PREFIX dbr: ... entry from the query
>> so IRIs were not re written.
>>
>>
>>         CONCLUSION
>>
>> The fact that I could use Jena without losing focus on my project
>> shows how it gets out of your way and does its job.
>> Apart from the initial data ingestion hurdles, I have not had any
>> issues with using Jena.
>> The Fuseki interface was a crucial component for me to learn SPARQL by
>> example.
>> Overall it was a pleasure to use.
>> Kudos to the developers!
>>



-- 
Your greatest regret is the email ID you choose in 8th grade

Re: Jena Usage Report: Performance, bugs and feature requests

Reply via email to