Re: Using JENA directly, instead of Fuseki/SPARQL

Dave Reynolds Sat, 23 Sep 2017 01:03:05 -0700

Hi Stefan,

Like all such things "it depends". It depends somewhat on the structureof your data and a lot on the nature of your queries.

We run one public facing service at that scale (about 400 milliontriples) where the data is quite regular and flat, no large literals,queries tend to be selective retrieval of entries - possibly retrievingquite a large number of entries but no analytic queries. Usage is low(less than 500k requests/month).

That's run off a load balanced pair of AWS i3.large instances so 16G RAMand lots of fast NVMe disk but the database on disk is only around 30GBso a lot of the disk is unused. Each replica comfortably runs fuseki, anAPI in tomcat and a rails app to provide a data explorer all on the oneserver. Replication is for fault tolerance and zero-downtime upgradesbut not needed to handle the load.

For a second service of the same sort of scale (similar number oftriples) the data is still relatively flat but queries can be a littlemore complex and there's a background managed queue of more complexanalytic queries competing with simpler interactive foreground requests.Usage levels are fairly high (significantly over 10M requests/month).The database on disk is larger (larger, more diverse literals) - I don'tknow the exact size off hand but I guess more than twice the size. Thatservice separates the API and applications on a different server tierfrom the data servers. The data servers only run fuseki but need atleast 32Gb to handle the throughput. Indeed we may increase that furtherso that under high load a bigger fraction of data can be cached in memory.


Dave

On 23/09/17 00:41, Dimov, Stefan wrote:

Thanks,

So if I have a node withe TDB 300-400MT what would be the (minimum) appropriate 
RAM/Disk size? Roughly speaking …

Regards,
Stefan

On 9/22/17, 4:36 PM, "[email protected]" <[email protected]> wrote:

     Can't speak for Dave, but I would think he meant "mega-triples"; 300-400 
million triples of data.
     It's a common way of talking about the size of RDF datasets.

ajs6f

     Dimov, Stefan wrote on 9/22/17 7:32 PM:
     > Thanks Dave,
     >
     > Can you clarify, please, what do you mean by “reasonable memory 
footprint for the data scale. For larger data (300-400MT plus) ”?
     >
     > “300-400MT” – do you mean 300-400MB RAM or probably by “MT” you mean 
“Mega-Transfer” or do you mean disk space?
     >
     > If 300-400MB is the RAM you are using, what is the corresponding disk 
size TDB takes in your particular case?
     >
     > Or if this is the disk-space, what is the size of your RAM?
     >
     > Regards,
     > Stefan
     >
     >
     > On 9/22/17, 1:22 PM, "Dimov, Stefan" <[email protected]> wrote:
     >
     >     Thanks, Dave!
     >
     >     S.
     >
     >     On 9/22/17, 12:21 AM, "Dave Reynolds" <[email protected]> 
wrote:
     >
     >         Sorry, missed this question ...
     >
     >         It depends on the scale of the data, the size of the tomcat 
application,
     >         the machine sizes available and how much API-side in-memory 
caching you
     >         want to do.
     >
     >         We use both styles successfully. For modest data even at high 
load, or
     >         for large data at modest load then having both on the same 
machine works
     >         fine and is slightly easier to scale out. So long as your 
machines have
     >         a reasonable memory footprint for the data scale. For larger data
     >         (300-400MT plus) with either significant query rates or very 
memory
     >         hungry applications we split the data and front end tiers.
     >
     >         Dave
     >
     >         On 19/09/17 20:02, Dimov, Stefan wrote:
     >         > Thanks for the response!
     >         >
     >         > One more question:
     >         >
     >         > Would it be better if I put Tomcat on one machine and have 
Fuseki on another?
     >         >
     >         > Provided they are both in the same network and the connection 
between them is unobstructed, wouldn’t this improve the performance, considering they 
don’t share memory/CPU?
     >         >
     >         > Regards,
     >         > Stefan
     >         >
     >         > On 9/19/17, 5:24 AM, "Dave Reynolds" 
<[email protected]> wrote:
     >         >
     >         >      On 19/09/17 11:33, George News wrote:
     >         >      >
     >         >      > On 2017-09-19 09:57, Dave Reynolds wrote:
     >         >      >> On 19/09/17 01:13, Dimov, Stefan wrote:
     >         >      >>> Hi,
     >         >      >>>
     >         >      >>> I have Tomcat setup, that receives REST requests, 
“translates” them
     >         >      >>> into SAPRQL queries, invokes them on the underlying 
FUSEKI and returns
     >         >      >>> the results:
     >         >      >>>
     >         >      >>>
     >         >      >>> USER AGENT
     >         >      >>> ^
     >         >      >>> REST
     >         >      >>> v
     >         >      >>> ---------------
     >         >      >>> TOMCAT
     >         >      >>> ^
     >         >      >>> REST
     >         >      >>> v
     >         >      >>> -------------
     >         >      >>> FUSEKI
     >         >      >>> ------------
     >         >      >>> JENA
     >         >      >>> -----------
     >         >      >>> TDB
     >         >      >>> ----------
     >         >      >>>
     >         >      >>> Would I be able to achieve significant performance 
improvement, if I
     >         >      >>> use directly the JENA libraries and bypass FUSEKI?
     >         >      >>
     >         >      >> Unlikely. We successfully use the set up you describe 
for dozens of
     >         >      >> services, some quite high load. We have a few which go 
direct to Jena
     >         >      >> for legacy reasons and they show no particular 
performance benefits.
     >         >      >>
     >         >      >> If your payloads can be large then make sure the way 
you are driving
     >         >      >> fuseki is streaming and doesn't accidentally store the 
entire SPARQL
     >         >      >> results in your tomcat app. This also means chosing a 
streamable media
     >         >      >> type for your fuseki requests.
     >         >      >
     >         >      > I'm using Jena to create my own REST service and I'm 
facing some issues
     >         >      > when SPARQL resultsets are big. Could you please give 
me a hint on the
     >         >      > streaming stuff from fuseki so I can incorporate that 
to my REST service?
     >         >
     >         >      If you are just doing SELECTs then it should be 
straightforward. Of the
     >         >      sparql results media types then at least XML and TSV are 
streaming. We
     >         >      just use Jena's QueryExecutionFactory.sparqlService in 
the REST service
     >         >      to set up the execution. We wrap the ResultSet from 
execSelect and
     >         >      process that one row at a time. Our wrapper keeps track 
of the
     >         >      underlying QueryExecution so we can close that when 
finished or in the
     >         >      event of a problem.
     >         >
     >         >      For DESCRIBE/CONSTRUCT queries then use a streamable 
media type for the
     >         >      RDF such as ntriples/nquads. We have less experience of 
that, we tend to
     >         >      actually execute those in batches (a SELECT provides a 
set of resource
     >         >      bindings and we then issue a DESCRIBE on those resources 
one batch at a
     >         >      time).
     >         >
     >         >      Dave
     >         >
     >         >
     >         >
     >
     >
     >
     >

Re: Using JENA directly, instead of Fuseki/SPARQL

Reply via email to