Thanks, Andy. This is a collection of roughly 4.7M triples. How would the triple count affect the usage estimation, in relation to the total size?
Best regards, -i On Sun, Mar 27, 2016 at 6:05 AM Andy Seaborne <[email protected]> wrote: > On 21/03/16 01:28, Ignacio Tripodi wrote: > > Hey Andy, > > > > Sorry about the duplicate post, I just removed the one on StackOverflow. > > > > This is using Lucene. At currently 1.6Gb, most of the content is a > > collection of (biological) taxonomic entities plus a few owl definitions > to > > lay out the ontology, and as you correctly guessed, imported as TDB. All > > .dat and .idn files after importing and rebuilding the indices add up to > > about 2.1Gb. Would the assumption that if we have as much free memory as > > 2.1Gb in this case, we would be in an optimal situation for caching? > > Yes - that is a good starting point. > > (Counts in triples would be useful.) > > > > > All SPARQL queries for partial string matches will be limited to only the > > first handful of (say, 5) results. Should I consider large result sets in > > the hardware estimations, regardless? Does Jena still have to internally > > bring up the entire result set before filtering the response? > > For a text query, then it does have to get all the text index results. > > The Lucene's IndexSearcher.search method returns TopDocs which is all > the results (after Lucene limiting). > > Andy > > > Your theory about swapping for the scenario of slow first requests makes > > sense. I'm not too concerned about it (at least until I see how it > behaves > > in production). > > > > Many thanks for the insights, > > > > -i > > > > > > On Sun, Mar 20, 2016 at 3:38 PM Andy Seaborne <[email protected]> wrote: > > > >> On 20/03/16 17:16, Ignacio Tripodi wrote: > >>> Hello, > >>> > >>> I was wondering if you had any minimum hardware suggestions for a > >>> Jena/Fuseki Linux deployment, based on the number of triples used. Is > >> there > >>> a rough guideline for how much RAM should be available in production, > as > >> a > >>> function of the size of the imported RDF file (currently less than > 2Gb), > >>> number of concurrent requests, etc? > >>> > >>> The main use for this will be for wildcarded text searches using the > >> Lucene > >>> full-text index (basically, unfiltered queries using the reverse > index). > >> No > >>> SPARQL Update needed. Other resource-intensive operations would be > >>> refreshing the RDF data monthly, followed by rebuilding indices. The > test > >>> deployment on my 2012 MacBook runs queries in the order of tens of ms > >>> (unless it's been idle for a while, then the first query is usually in > >> the > >>> order of hundreds of ms for some reason), so I imagine the hardware > >>> requirements can't be that stringent. If it helps, I had to increase my > >>> Java heap size to 3072Mb. > >>> > >>> Thanks for any feedback you could provide! > >>> > >> > >> [[ > >> This has been asked on StackOverflow - please copy answers from one > >> place to the other. > >> ]] > >> > >> 2G in bytes - what is it in triples? > >> > >> Is this Lucene or Solr? > >> > >> Is the RDF data held in TDB as the storage? If so, then the part due to > >> TDB using memory mapped files - these exist in the OS file system cache > >> not in the java heap. The amount of space it need flexes with use (the > >> OS does the flexing automatically. > >> > >> Fir TDB: > >> > >> TDB write transactions use memory for intermediate space. Read requests > >> do not normally take space over and above the database caching. > >> > >> If the data has many large literals, then more heap may be needed > >> otherwise the space is due to Lucene itself. The jena text subsystem > >> materializing results so very large results also these may be a factor. > >> > >> The fact that being idle means the next query is slow is possibly due to > >> the fact that either the machine is swapping and the in-RAM cached data > >> got swapped out, or that the file system cache has displaced data and so > >> it has to go to persistent storage. If you were doing other things on > >> the machine, it is more likely the latter. > >> > >> Andy > >> > > > >
