Excellent! I'll keep an eye out for the next release. In the meantime, this provides enough background information. Thanks, Andy
-i On Wed, Mar 30, 2016 at 1:54 PM Andy Seaborne <[email protected]> wrote: > On 29/03/16 18:38, Ignacio Tripodi wrote: > > Thanks, Andy. This is a collection of roughly 4.7M triples. How would the > > triple count affect the usage estimation, in relation to the total size? > > > > From the RDF storage point of view, 4.7 million triples isn't big (it > fits in memory [*] or gets so cached in TDB that it's effectively > in-memory much of which is no in-heap.) Together with the Lucene side, > looks fine for a current server class machine (it seems to be 16G > tending to 32G range these days - this email will be out of date soon! > [**]) > > An SSD is good. > > And generally, portables of any kind have slower I/O paths than servers. > > > Andy > > [*] and the amount of heap needed for a parsed files decrease with the > next release due to some node caching. > > > [**] > Don't set a Java heap between 32G and 48G! > > https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/ > > > Best regards, > > > > -i > > > > On Sun, Mar 27, 2016 at 6:05 AM Andy Seaborne <[email protected]> wrote: > > > >> On 21/03/16 01:28, Ignacio Tripodi wrote: > >>> Hey Andy, > >>> > >>> Sorry about the duplicate post, I just removed the one on > StackOverflow. > >>> > >>> This is using Lucene. At currently 1.6Gb, most of the content is a > >>> collection of (biological) taxonomic entities plus a few owl > definitions > >> to > >>> lay out the ontology, and as you correctly guessed, imported as TDB. > All > >>> .dat and .idn files after importing and rebuilding the indices add up > to > >>> about 2.1Gb. Would the assumption that if we have as much free memory > as > >>> 2.1Gb in this case, we would be in an optimal situation for caching? > >> > >> Yes - that is a good starting point. > >> > >> (Counts in triples would be useful.) > >> > >>> > >>> All SPARQL queries for partial string matches will be limited to only > the > >>> first handful of (say, 5) results. Should I consider large result sets > in > >>> the hardware estimations, regardless? Does Jena still have to > internally > >>> bring up the entire result set before filtering the response? > >> > >> For a text query, then it does have to get all the text index results. > >> > >> The Lucene's IndexSearcher.search method returns TopDocs which is all > >> the results (after Lucene limiting). > >> > >> Andy > >> > >>> Your theory about swapping for the scenario of slow first requests > makes > >>> sense. I'm not too concerned about it (at least until I see how it > >> behaves > >>> in production). > >>> > >>> Many thanks for the insights, > >>> > >>> -i > >>> > >>> > >>> On Sun, Mar 20, 2016 at 3:38 PM Andy Seaborne <[email protected]> wrote: > >>> > >>>> On 20/03/16 17:16, Ignacio Tripodi wrote: > >>>>> Hello, > >>>>> > >>>>> I was wondering if you had any minimum hardware suggestions for a > >>>>> Jena/Fuseki Linux deployment, based on the number of triples used. Is > >>>> there > >>>>> a rough guideline for how much RAM should be available in production, > >> as > >>>> a > >>>>> function of the size of the imported RDF file (currently less than > >> 2Gb), > >>>>> number of concurrent requests, etc? > >>>>> > >>>>> The main use for this will be for wildcarded text searches using the > >>>> Lucene > >>>>> full-text index (basically, unfiltered queries using the reverse > >> index). > >>>> No > >>>>> SPARQL Update needed. Other resource-intensive operations would be > >>>>> refreshing the RDF data monthly, followed by rebuilding indices. The > >> test > >>>>> deployment on my 2012 MacBook runs queries in the order of tens of ms > >>>>> (unless it's been idle for a while, then the first query is usually > in > >>>> the > >>>>> order of hundreds of ms for some reason), so I imagine the hardware > >>>>> requirements can't be that stringent. If it helps, I had to increase > my > >>>>> Java heap size to 3072Mb. > >>>>> > >>>>> Thanks for any feedback you could provide! > >>>>> > >>>> > >>>> [[ > >>>> This has been asked on StackOverflow - please copy answers from one > >>>> place to the other. > >>>> ]] > >>>> > >>>> 2G in bytes - what is it in triples? > >>>> > >>>> Is this Lucene or Solr? > >>>> > >>>> Is the RDF data held in TDB as the storage? If so, then the part due > to > >>>> TDB using memory mapped files - these exist in the OS file system > cache > >>>> not in the java heap. The amount of space it need flexes with use > (the > >>>> OS does the flexing automatically. > >>>> > >>>> Fir TDB: > >>>> > >>>> TDB write transactions use memory for intermediate space. Read > requests > >>>> do not normally take space over and above the database caching. > >>>> > >>>> If the data has many large literals, then more heap may be needed > >>>> otherwise the space is due to Lucene itself. The jena text subsystem > >>>> materializing results so very large results also these may be a > factor. > >>>> > >>>> The fact that being idle means the next query is slow is possibly due > to > >>>> the fact that either the machine is swapping and the in-RAM cached > data > >>>> got swapped out, or that the file system cache has displaced data and > so > >>>> it has to go to persistent storage. If you were doing other things on > >>>> the machine, it is more likely the latter. > >>>> > >>>> Andy > >>>> > >>> > >> > >> > > > >
