Andy, Thank you for the reply. I suspected that Jena/TDB might be targeted at somewhat different use cases. Is there a document somewhere that characterizes the sort of assumptions about how Jena/TDB are expcted to be used?
We'll explore our use case and let you know what we find. Thank you again, Chris > On Aug 20, 2017, at 11:00, Andy Seaborne <[email protected]> wrote: > > I don't have any experience running about thing like this. I was hoping to > learn from other people's experiences. > > From a base-technology point of view, this isn't TDB's design centre so theer > may be hot-spots. The only real way to know if it is acceptable is to try an > experiment. It will depend on what you want to do with the store. > > With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(), > contains()) will be expensive. So that is a requirement, a text index is > probably necessary whether you store the page content in RDF or not. > > One area will be the TDB node cache, the cache of internal TDB NodeId-> > RDFterm (Node). This is a count-based, and does not consider the size of item > cached. The cache is going to keep pages cached so it's going to use heap RAM > especially as characters are 2 bytes. There again, it's only 10G or so. > > See the documentation for tuning caches: > https://jena.apache.org/documentation/tdb/store-parameters.html > > Andy > >> On 19/08/17 15:20, Chris Tomlinson wrote: >> Hi again, >> Is anyone aware of any issues that may arise when storing triples in TDB >> that have very large string literals (~17KB)? >> The use case is illustrated below. This seems a reasonable question under >> the assumption that literals are presumed to be small - like names, titles, >> maybe summaries or abstracts and such, rather than entire pages of text. >> Thanks, >> Chris >>> On Aug 17, 2017, at 12:48 PM, Chris Tomlinson <[email protected]> >>> wrote: >>> >>> Hello, >>> >>> We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, >>> for a total of 4GB of text. These texts are currently indexed via Lucene in >>> an XMLdb and we’re wanting to know if there are any known issues regarding >>> large literals in Jena. >>> >>> In other words we are considering storing the texts like: >>> >>> :Text_08357 a :EText ; >>> various metadata about the EText >>> :hasPage >>> [ :pageNum 1 ; >>> :content “. . . 17,000 Bytes . . .” ] , >>> [ :pageNum 2 ; >>> :content “. . . 17,000 Bytes . . .” ] , >>> . . . >>> >>> We know that Lucene is happy with this data, but we’re not sure whether >>> Jena/TDB will be stressed with 229K triples with 17KB literals. >>> >>> The Jena-text offers the possibility of indexing in Lucene via a separate >>> process and just using the search in Jena without actually storing the >>> literals in TDB. This is a somewhat complex configuration and it would be >>> preferred to not use this approach unless the size of the literals will >>> present a problem. >>> >>> Thank you, >>> Chris >>> >>>
