I don't have any experience running about thing like this. I was hoping
to learn from other people's experiences.
From a base-technology point of view, this isn't TDB's design centre so
theer may be hot-spots. The only real way to know if it is acceptable is
to try an experiment. It will depend on what you want to do with the store.
With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(),
contains()) will be expensive. So that is a requirement, a text index
is probably necessary whether you store the page content in RDF or not.
One area will be the TDB node cache, the cache of internal TDB NodeId->
RDFterm (Node). This is a count-based, and does not consider the size of
item cached. The cache is going to keep pages cached so it's going to
use heap RAM especially as characters are 2 bytes. There again, it's
only 10G or so.
See the documentation for tuning caches:
https://jena.apache.org/documentation/tdb/store-parameters.html
Andy
On 19/08/17 15:20, Chris Tomlinson wrote:
Hi again,
Is anyone aware of any issues that may arise when storing triples in TDB that
have very large string literals (~17KB)?
The use case is illustrated below. This seems a reasonable question under the
assumption that literals are presumed to be small - like names, titles, maybe
summaries or abstracts and such, rather than entire pages of text.
Thanks,
Chris
On Aug 17, 2017, at 12:48 PM, Chris Tomlinson <[email protected]>
wrote:
Hello,
We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, for
a total of 4GB of text. These texts are currently indexed via Lucene in an
XMLdb and we’re wanting to know if there are any known issues regarding large
literals in Jena.
In other words we are considering storing the texts like:
:Text_08357 a :EText ;
various metadata about the EText
:hasPage
[ :pageNum 1 ;
:content “. . . 17,000 Bytes . . .” ] ,
[ :pageNum 2 ;
:content “. . . 17,000 Bytes . . .” ] ,
. . .
We know that Lucene is happy with this data, but we’re not sure whether
Jena/TDB will be stressed with 229K triples with 17KB literals.
The Jena-text offers the possibility of indexing in Lucene via a separate
process and just using the search in Jena without actually storing the literals
in TDB. This is a somewhat complex configuration and it would be preferred to
not use this approach unless the size of the literals will present a problem.
Thank you,
Chris