Thanks, Andy. This is a collection of roughly 4.7M triples. How would the
triple count affect the usage estimation, in relation to the total size?

Best regards,

-i

On Sun, Mar 27, 2016 at 6:05 AM Andy Seaborne <[email protected]> wrote:

> On 21/03/16 01:28, Ignacio Tripodi wrote:
> > Hey Andy,
> >
> > Sorry about the duplicate post, I just removed the one on StackOverflow.
> >
> > This is using Lucene. At currently 1.6Gb, most of the content is a
> > collection of (biological) taxonomic entities plus a few owl definitions
> to
> > lay out the ontology, and as you correctly guessed, imported as TDB. All
> > .dat and .idn files after importing and rebuilding the indices add up to
> > about 2.1Gb. Would the assumption that if we have as much free memory as
> > 2.1Gb in this case, we would be in an optimal situation for caching?
>
> Yes - that is a good starting point.
>
> (Counts in triples would be useful.)
>
> >
> > All SPARQL queries for partial string matches will be limited to only the
> > first handful of (say, 5) results. Should I consider large result sets in
> > the hardware estimations, regardless? Does Jena still have to internally
> > bring up the entire result set before filtering the response?
>
> For a text query, then it does have to get all the text index results.
>
> The Lucene's IndexSearcher.search method returns TopDocs which is all
> the results (after Lucene limiting).
>
>         Andy
>
> > Your theory about swapping for the scenario of slow first requests makes
> > sense. I'm not too concerned about it (at least until I see how it
> behaves
> > in production).
> >
> > Many thanks for the insights,
> >
> > -i
> >
> >
> > On Sun, Mar 20, 2016 at 3:38 PM Andy Seaborne <[email protected]> wrote:
> >
> >> On 20/03/16 17:16, Ignacio Tripodi wrote:
> >>> Hello,
> >>>
> >>> I was wondering if you had any minimum hardware suggestions for a
> >>> Jena/Fuseki Linux deployment, based on the number of triples used. Is
> >> there
> >>> a rough guideline for how much RAM should be available in production,
> as
> >> a
> >>> function of the size of the imported RDF file (currently less than
> 2Gb),
> >>> number of concurrent requests, etc?
> >>>
> >>> The main use for this will be for wildcarded text searches using the
> >> Lucene
> >>> full-text index (basically, unfiltered queries using the reverse
> index).
> >> No
> >>> SPARQL Update needed. Other resource-intensive operations would be
> >>> refreshing the RDF data monthly, followed by rebuilding indices. The
> test
> >>> deployment on my 2012 MacBook runs queries in the order of tens of ms
> >>> (unless it's been idle for a while, then the first query is usually in
> >> the
> >>> order of hundreds of ms for some reason), so I imagine the hardware
> >>> requirements can't be that stringent. If it helps, I had to increase my
> >>> Java heap size to 3072Mb.
> >>>
> >>> Thanks for any feedback you could provide!
> >>>
> >>
> >> [[
> >> This has been asked on StackOverflow - please copy answers from one
> >> place to the other.
> >> ]]
> >>
> >> 2G in bytes - what is it in triples?
> >>
> >> Is this Lucene or Solr?
> >>
> >> Is the RDF data held in TDB as the storage? If so, then the part due to
> >> TDB using memory mapped files - these exist in the OS file system cache
> >> not in the java heap.  The amount of space it need flexes with use (the
> >> OS does the flexing automatically.
> >>
> >> Fir TDB:
> >>
> >> TDB write transactions use memory for intermediate space.  Read requests
> >> do not normally take space over and above the database caching.
> >>
> >> If the data has many large literals, then more heap may be needed
> >> otherwise the space is due to Lucene itself.  The jena text subsystem
> >> materializing results so very large results also these may be a factor.
> >>
> >> The fact that being idle means the next query is slow is possibly due to
> >> the fact that either the machine is swapping and the in-RAM cached data
> >> got swapped out, or that the file system cache has displaced data and so
> >> it has to go to persistent storage.  If you were doing other things on
> >> the machine, it is more likely the latter.
> >>
> >>          Andy
> >>
> >
>
>

Reply via email to