Re: Jena Usage Report: Performance, bugs and feature requests

Claude Warren Wed, 30 May 2018 18:12:02 -0700

Just a quick note.  There is a cassandra implementation but no work has
been done on performance tuning.


On a second note.  I did some work using bloom filters to do partitioning
that allows adding partitions on demand.  Should work for triple store
partitioning as well.

Claude

On Wed, May 30, 2018, 8:43 AM Siddhesh Rane <kingsid...@gmail.com> wrote:

> For my undergraduate project I used Fuseki 3.6.0 server backed by a TDB
> dataset.
> 30000 unique SPARQL queries were made against the server by 6 nodes in
> a Spark cluster returning a total of 150 million triples.
> As I used DBpedia's dataset, nearly all the entities from Wikipedia
> were covered, so my experiment is somewhat like an exhaustive test.
>
> I write as someone quite new to Jena and SPARQL itself so I may have
> faced problems for doing things the wrong way, or not knowing better
> solutions.
> Although Fuseki was a critical component in my pipeline, I could not
> spend much time on learning it properly, so kindly forgive any
> ignorance on my part.
> I hope my experience will help the developers to know the ways in
> which at least newcomers are using this software.
>
>
>     DATA INGESTION
>
> This was the most tedious part about using Jena.
> The ability to create a TDB database and upload data to it, all from
> the browser, is a really nice feature. The difficult part is that the
> memory required to do so is proportional to the size of the data being
> uploaded. The largest file that I was trying to upload contained 158M
> triples (24GB uncompressed, 1.5GB bz2 compressed) and it was
> frequently running out of memory. I had to set Fuseki to -Xmx32g and
> only then did it work. Command line tools faced the same problem.
>
> Another thing is that both the web interface and command line tools
> optionally accept gzip files, but not bzip2 wheareas bzip2 is used by
> both Wikipedia and DBpedia for their data dumps.
> I tried to work around the issue by `bzcat file.ttl.bz2 | gzip >
> named-pipe` and then using the named-pipe for data ingestion but that
> did not work.
>
> I finally ended up using `tdbloader2` which works with constant memory
> and, as I read somewhere on the mailing list, produces the smallest
> size database.
> There might be some SPARQL way for inserting data in batches and I
> probably could have scripted that but I had a project to complete and
> so went with what appeared to be the most straightforward way of doing
> things.
>
> Performance of tdbloader2:
> On my 2017 Spectre x360 laptop with 16GB RAM, dual core i7-7500U cpu
> and 512GB SSD
> Phase I: 199,597,131 tuples; 2,941.31 seconds ; 67,860.02 tuples/sec
> Totat time 4609s
>
> I wanted to control what indexes are generated because I knew the
> access pattern of my SPARQL queries and also wanted smaller DB size.
> I think there are toggles to decide what indexes are generated but I
> did not try to search much.
> Controlling the indexes from tdbloader2 itself would be a great
> option. If this feature already exists, please let me know.
>
> Another point of confusion is the version of the backing TDB database.
> The `bin` folder contains tdb commands for both v1 and v2 but I'm not
> sure what version does Fuseki use when I create a persistent store.
> The database config file `config.ttl` does not mention any version.
> I would appreciate if someone could clear up this confusion for me.
>
> TL;DR please support bzip2, reading from pipes and constant memory
> loading operations.
>
>
>     DATABASE PERFORMANCE, REPLICATION and SHARDING
>
> My project used Spark to distribute the load among a cluster of
> machines. The input data was all the articles in Wikipedia. Each
> partition of the data would contain about 250 articles. The first
> SPARQL query was to DESCRIBE these articles. A subsequent CONSTRUCT
> query would fetch the labels for all the object resources in the model
> returned by the first query. There were 44 cores in the cluster so at
> any time 44 partitions would generate 44*2=88 SPARQL queries. The
> DESCRIBE query would run in milliseconds whereas the CONSTRUCT query
> would take 1-2 seconds, because of the random access nature.
> Benchmarks were not comprehensive, just observation of log output.
>
> I got this performance when the entire database was resident in RAM,
> as reported by `vmtouch` (https://hoytech.com/vmtouch/)
> Without complete memory mapping, the performance would degrade to
> 500-1000 seconds per CONSTRUCT query. In my case the db was 16-19GB in
> size so it could be `vmtouched` in RAM on a 32GB RAM 8 core Xeon
> machine.
>
> To increase performance further I replicated the db on an identical
> machine and load balanced queries between the two machines. The
> execution time of my entire Spark app went down from 2 hours to 1
> hour. A recent thread on this list talks about high availability and
> replication. You can just assign different threads to query different
> replicas of the db with fallback on the other and that would be
> sufficient in most cases.
>
> Memory mapping with vmtouch and replication are easy performance wins
> but they have their limitations.
> They worked in this case because my DB size was quite small to fit on
> a 32 GB machine and I had such machines available at my disposal.
> As data size keeps increasing exponentially everyday, memory sizes
> cannot keep up and no single machine can hold the entire database.
> My cluster had a combined capacity of around 80 GB and if you leave
> out 12 GB that was used by the Spark workers, there was 68 GB of RAM
> available which could be used to hold the database.
> But since each individual node had <8 GB RAM, the db could not fit on
> any of them.
> The solution here is that Fuseki itself should support some form of
> sharding, so that it can access pieces of the index stored on lots of
> machines with small amount of RAM.
> Alternately, providing integration with other dbs like Redis,
> Cassandra, or even Spark with GraphX would be a better solution to
> scale to bigger datasets.
>
> Jena's prime features for me were the Fuseki server, the Java library
> to operate on local models and, even though I did not use it but plan
> to, the inference features.
> I don't think database is something that the developers should spend
> their energy on. Rather than turning TDB into a full blown distributed
> DB, it would be better to integrate with one of the open source ones
> already available.
>
> TL;DR   Database sharding is needed for scaling
>
>
>         QUERY PERFORMANCE
>
> I have noticed that combining the result of two or more simple queries
> is orders of magnitude faster than a single query which performs the
> same function.
> Take this example.
>
>         # parent+child count
>         select ?type (count(?obj) as ?c) where {
>           ?type rdfs:subClassOf owl:Thing .
>           ?obj a/rdfs:subClassOf? ?type
>         }
>         group by ?type
>
> This query took 1022 seconds to execute. Inference features were not
> used for any of the queries.
>
> We can split this into 2 simpler queries
>
>         # parent count took 2.57 seconds
>         select ?type (count(?obj) as ?c) where {
>           ?type rdfs:subClassOf owl:Thing .
>           ?obj a ?type
>         }
>         group by ?type
>
> and
>
>         # child count took 10.86 seconds
>         select ?type (count(?obj) as ?c) where {
>           ?type rdfs:subClassOf owl:Thing .
>           ?obj a/rdfs:subClassOf ?type
>         }
>         group by ?type
>
>
> The simpler queries took a combined time of 2.57+10.86=13.43 seconds
> when fired sequentially.
> In a load balanced scheme each query would be served by a different
> server and then the execution time of the longest query becomes the
> bottleneck.
> The combined query is 1022/13.46= ~76 times slower (1022/10.86=92
> times slower than parallel execution).
>
> Query optimization is something that is needed for SQL even after so
> many years of being on the market.
> SPARQL is relatively new so it is understandable that query
> optimization may not be that developed yet.
>
> A small bug I encountered.
> During execution some of my queries got malformed because FQDN IRIs
> are shortened to prefix:resource format in turtle.
> I believe this happens with IRIs that do not conform to standards.
> DBpedia has lots of them.
> I got around that by removing the PREFIX dbr: ... entry from the query
> so IRIs were not re written.
>
>
>         CONCLUSION
>
> The fact that I could use Jena without losing focus on my project
> shows how it gets out of your way and does its job.
> Apart from the initial data ingestion hurdles, I have not had any
> issues with using Jena.
> The Fuseki interface was a crucial component for me to learn SPARQL by
> example.
> Overall it was a pleasure to use.
> Kudos to the developers!
>

Re: Jena Usage Report: Performance, bugs and feature requests

Reply via email to