Excellent! I'll keep an eye out for the next release. In the meantime, this
provides enough background information.
Thanks, Andy

-i


On Wed, Mar 30, 2016 at 1:54 PM Andy Seaborne <[email protected]> wrote:

> On 29/03/16 18:38, Ignacio Tripodi wrote:
> > Thanks, Andy. This is a collection of roughly 4.7M triples. How would the
> > triple count affect the usage estimation, in relation to the total size?
> >
>
>  From the RDF storage point of view, 4.7 million triples isn't big (it
> fits in memory [*] or gets so cached in TDB that it's effectively
> in-memory much of which is no in-heap.)  Together with the Lucene side,
> looks fine for a current server class machine (it seems to be 16G
> tending to 32G range these days - this email will be out of date soon!
> [**])
>
> An SSD is good.
>
> And generally, portables of any kind have slower I/O paths than servers.
>
>
>         Andy
>
> [*] and the amount of heap needed for a parsed files decrease with the
> next release due to some node caching.
>
>
> [**]
> Don't set a Java heap between 32G and 48G!
>
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>
> > Best regards,
> >
> > -i
> >
> > On Sun, Mar 27, 2016 at 6:05 AM Andy Seaborne <[email protected]> wrote:
> >
> >> On 21/03/16 01:28, Ignacio Tripodi wrote:
> >>> Hey Andy,
> >>>
> >>> Sorry about the duplicate post, I just removed the one on
> StackOverflow.
> >>>
> >>> This is using Lucene. At currently 1.6Gb, most of the content is a
> >>> collection of (biological) taxonomic entities plus a few owl
> definitions
> >> to
> >>> lay out the ontology, and as you correctly guessed, imported as TDB.
> All
> >>> .dat and .idn files after importing and rebuilding the indices add up
> to
> >>> about 2.1Gb. Would the assumption that if we have as much free memory
> as
> >>> 2.1Gb in this case, we would be in an optimal situation for caching?
> >>
> >> Yes - that is a good starting point.
> >>
> >> (Counts in triples would be useful.)
> >>
> >>>
> >>> All SPARQL queries for partial string matches will be limited to only
> the
> >>> first handful of (say, 5) results. Should I consider large result sets
> in
> >>> the hardware estimations, regardless? Does Jena still have to
> internally
> >>> bring up the entire result set before filtering the response?
> >>
> >> For a text query, then it does have to get all the text index results.
> >>
> >> The Lucene's IndexSearcher.search method returns TopDocs which is all
> >> the results (after Lucene limiting).
> >>
> >>          Andy
> >>
> >>> Your theory about swapping for the scenario of slow first requests
> makes
> >>> sense. I'm not too concerned about it (at least until I see how it
> >> behaves
> >>> in production).
> >>>
> >>> Many thanks for the insights,
> >>>
> >>> -i
> >>>
> >>>
> >>> On Sun, Mar 20, 2016 at 3:38 PM Andy Seaborne <[email protected]> wrote:
> >>>
> >>>> On 20/03/16 17:16, Ignacio Tripodi wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I was wondering if you had any minimum hardware suggestions for a
> >>>>> Jena/Fuseki Linux deployment, based on the number of triples used. Is
> >>>> there
> >>>>> a rough guideline for how much RAM should be available in production,
> >> as
> >>>> a
> >>>>> function of the size of the imported RDF file (currently less than
> >> 2Gb),
> >>>>> number of concurrent requests, etc?
> >>>>>
> >>>>> The main use for this will be for wildcarded text searches using the
> >>>> Lucene
> >>>>> full-text index (basically, unfiltered queries using the reverse
> >> index).
> >>>> No
> >>>>> SPARQL Update needed. Other resource-intensive operations would be
> >>>>> refreshing the RDF data monthly, followed by rebuilding indices. The
> >> test
> >>>>> deployment on my 2012 MacBook runs queries in the order of tens of ms
> >>>>> (unless it's been idle for a while, then the first query is usually
> in
> >>>> the
> >>>>> order of hundreds of ms for some reason), so I imagine the hardware
> >>>>> requirements can't be that stringent. If it helps, I had to increase
> my
> >>>>> Java heap size to 3072Mb.
> >>>>>
> >>>>> Thanks for any feedback you could provide!
> >>>>>
> >>>>
> >>>> [[
> >>>> This has been asked on StackOverflow - please copy answers from one
> >>>> place to the other.
> >>>> ]]
> >>>>
> >>>> 2G in bytes - what is it in triples?
> >>>>
> >>>> Is this Lucene or Solr?
> >>>>
> >>>> Is the RDF data held in TDB as the storage? If so, then the part due
> to
> >>>> TDB using memory mapped files - these exist in the OS file system
> cache
> >>>> not in the java heap.  The amount of space it need flexes with use
> (the
> >>>> OS does the flexing automatically.
> >>>>
> >>>> Fir TDB:
> >>>>
> >>>> TDB write transactions use memory for intermediate space.  Read
> requests
> >>>> do not normally take space over and above the database caching.
> >>>>
> >>>> If the data has many large literals, then more heap may be needed
> >>>> otherwise the space is due to Lucene itself.  The jena text subsystem
> >>>> materializing results so very large results also these may be a
> factor.
> >>>>
> >>>> The fact that being idle means the next query is slow is possibly due
> to
> >>>> the fact that either the machine is swapping and the in-RAM cached
> data
> >>>> got swapped out, or that the file system cache has displaced data and
> so
> >>>> it has to go to persistent storage.  If you were doing other things on
> >>>> the machine, it is more likely the latter.
> >>>>
> >>>>           Andy
> >>>>
> >>>
> >>
> >>
> >
>
>

Reply via email to