Hey Andy, Sorry about the duplicate post, I just removed the one on StackOverflow.
This is using Lucene. At currently 1.6Gb, most of the content is a collection of (biological) taxonomic entities plus a few owl definitions to lay out the ontology, and as you correctly guessed, imported as TDB. All .dat and .idn files after importing and rebuilding the indices add up to about 2.1Gb. Would the assumption that if we have as much free memory as 2.1Gb in this case, we would be in an optimal situation for caching? All SPARQL queries for partial string matches will be limited to only the first handful of (say, 5) results. Should I consider large result sets in the hardware estimations, regardless? Does Jena still have to internally bring up the entire result set before filtering the response? Your theory about swapping for the scenario of slow first requests makes sense. I'm not too concerned about it (at least until I see how it behaves in production). Many thanks for the insights, -i On Sun, Mar 20, 2016 at 3:38 PM Andy Seaborne <[email protected]> wrote: > On 20/03/16 17:16, Ignacio Tripodi wrote: > > Hello, > > > > I was wondering if you had any minimum hardware suggestions for a > > Jena/Fuseki Linux deployment, based on the number of triples used. Is > there > > a rough guideline for how much RAM should be available in production, as > a > > function of the size of the imported RDF file (currently less than 2Gb), > > number of concurrent requests, etc? > > > > The main use for this will be for wildcarded text searches using the > Lucene > > full-text index (basically, unfiltered queries using the reverse index). > No > > SPARQL Update needed. Other resource-intensive operations would be > > refreshing the RDF data monthly, followed by rebuilding indices. The test > > deployment on my 2012 MacBook runs queries in the order of tens of ms > > (unless it's been idle for a while, then the first query is usually in > the > > order of hundreds of ms for some reason), so I imagine the hardware > > requirements can't be that stringent. If it helps, I had to increase my > > Java heap size to 3072Mb. > > > > Thanks for any feedback you could provide! > > > > [[ > This has been asked on StackOverflow - please copy answers from one > place to the other. > ]] > > 2G in bytes - what is it in triples? > > Is this Lucene or Solr? > > Is the RDF data held in TDB as the storage? If so, then the part due to > TDB using memory mapped files - these exist in the OS file system cache > not in the java heap. The amount of space it need flexes with use (the > OS does the flexing automatically. > > Fir TDB: > > TDB write transactions use memory for intermediate space. Read requests > do not normally take space over and above the database caching. > > If the data has many large literals, then more heap may be needed > otherwise the space is due to Lucene itself. The jena text subsystem > materializing results so very large results also these may be a factor. > > The fact that being idle means the next query is slow is possibly due to > the fact that either the machine is swapping and the in-RAM cached data > got swapped out, or that the file system cache has displaced data and so > it has to go to persistent storage. If you were doing other things on > the machine, it is more likely the latter. > > Andy >
