Re: Performance with very long strings - Re: large literals best practice?
Andy, Thank you for the reply. I suspected that Jena/TDB might be targeted at somewhat different use cases. Is there a document somewhere that characterizes the sort of assumptions about how Jena/TDB are expcted to be used? We'll explore our use case and let you know what we find. Thank you again, Chris > On Aug 20, 2017, at 11:00, Andy Seabornewrote: > > I don't have any experience running about thing like this. I was hoping to > learn from other people's experiences. > > From a base-technology point of view, this isn't TDB's design centre so theer > may be hot-spots. The only real way to know if it is acceptable is to try an > experiment. It will depend on what you want to do with the store. > > With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(), > contains()) will be expensive. So that is a requirement, a text index is > probably necessary whether you store the page content in RDF or not. > > One area will be the TDB node cache, the cache of internal TDB NodeId-> > RDFterm (Node). This is a count-based, and does not consider the size of item > cached. The cache is going to keep pages cached so it's going to use heap RAM > especially as characters are 2 bytes. There again, it's only 10G or so. > > See the documentation for tuning caches: > https://jena.apache.org/documentation/tdb/store-parameters.html > >Andy > >> On 19/08/17 15:20, Chris Tomlinson wrote: >> Hi again, >> Is anyone aware of any issues that may arise when storing triples in TDB >> that have very large string literals (~17KB)? >> The use case is illustrated below. This seems a reasonable question under >> the assumption that literals are presumed to be small - like names, titles, >> maybe summaries or abstracts and such, rather than entire pages of text. >> Thanks, >> Chris >>> On Aug 17, 2017, at 12:48 PM, Chris Tomlinson >>> wrote: >>> >>> Hello, >>> >>> We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, >>> for a total of 4GB of text. These texts are currently indexed via Lucene in >>> an XMLdb and we’re wanting to know if there are any known issues regarding >>> large literals in Jena. >>> >>> In other words we are considering storing the texts like: >>> >>> :Text_08357 a :EText ; >>> various metadata about the EText >>> :hasPage >>> [ :pageNum 1 ; >>> :content “. . . 17,000 Bytes . . .” ] , >>> [ :pageNum 2 ; >>> :content “. . . 17,000 Bytes . . .” ] , >>> . . . >>> >>> We know that Lucene is happy with this data, but we’re not sure whether >>> Jena/TDB will be stressed with 229K triples with 17KB literals. >>> >>> The Jena-text offers the possibility of indexing in Lucene via a separate >>> process and just using the search in Jena without actually storing the >>> literals in TDB. This is a somewhat complex configuration and it would be >>> preferred to not use this approach unless the size of the literals will >>> present a problem. >>> >>> Thank you, >>> Chris >>> >>>
Re: Performance with very long strings - Re: large literals best practice?
I don't have any experience running about thing like this. I was hoping to learn from other people's experiences. From a base-technology point of view, this isn't TDB's design centre so theer may be hot-spots. The only real way to know if it is acceptable is to try an experiment. It will depend on what you want to do with the store. With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(), contains()) will be expensive. So that is a requirement, a text index is probably necessary whether you store the page content in RDF or not. One area will be the TDB node cache, the cache of internal TDB NodeId-> RDFterm (Node). This is a count-based, and does not consider the size of item cached. The cache is going to keep pages cached so it's going to use heap RAM especially as characters are 2 bytes. There again, it's only 10G or so. See the documentation for tuning caches: https://jena.apache.org/documentation/tdb/store-parameters.html Andy On 19/08/17 15:20, Chris Tomlinson wrote: Hi again, Is anyone aware of any issues that may arise when storing triples in TDB that have very large string literals (~17KB)? The use case is illustrated below. This seems a reasonable question under the assumption that literals are presumed to be small - like names, titles, maybe summaries or abstracts and such, rather than entire pages of text. Thanks, Chris On Aug 17, 2017, at 12:48 PM, Chris Tomlinsonwrote: Hello, We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, for a total of 4GB of text. These texts are currently indexed via Lucene in an XMLdb and we’re wanting to know if there are any known issues regarding large literals in Jena. In other words we are considering storing the texts like: :Text_08357 a :EText ; various metadata about the EText :hasPage [ :pageNum 1 ; :content “. . . 17,000 Bytes . . .” ] , [ :pageNum 2 ; :content “. . . 17,000 Bytes . . .” ] , . . . We know that Lucene is happy with this data, but we’re not sure whether Jena/TDB will be stressed with 229K triples with 17KB literals. The Jena-text offers the possibility of indexing in Lucene via a separate process and just using the search in Jena without actually storing the literals in TDB. This is a somewhat complex configuration and it would be preferred to not use this approach unless the size of the literals will present a problem. Thank you, Chris
Performance with very long strings - Re: large literals best practice?
Hi again, Is anyone aware of any issues that may arise when storing triples in TDB that have very large string literals (~17KB)? The use case is illustrated below. This seems a reasonable question under the assumption that literals are presumed to be small - like names, titles, maybe summaries or abstracts and such, rather than entire pages of text. Thanks, Chris > On Aug 17, 2017, at 12:48 PM, Chris Tomlinson> wrote: > > Hello, > > We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, > for a total of 4GB of text. These texts are currently indexed via Lucene in > an XMLdb and we’re wanting to know if there are any known issues regarding > large literals in Jena. > > In other words we are considering storing the texts like: > > :Text_08357 a :EText ; > various metadata about the EText > :hasPage > [ :pageNum 1 ; > :content “. . . 17,000 Bytes . . .” ] , > [ :pageNum 2 ; > :content “. . . 17,000 Bytes . . .” ] , > . . . > > We know that Lucene is happy with this data, but we’re not sure whether > Jena/TDB will be stressed with 229K triples with 17KB literals. > > The Jena-text offers the possibility of indexing in Lucene via a separate > process and just using the search in Jena without actually storing the > literals in TDB. This is a somewhat complex configuration and it would be > preferred to not use this approach unless the size of the literals will > present a problem. > > Thank you, > Chris > >
large literals best practice?
Hello, We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, for a total of 4GB of text. These texts are currently indexed via Lucene in an XMLdb and we’re wanting to know if there are any known issues regarding large literals in Jena. In other words we are considering storing the texts like: :Text_08357 a :EText ; various metadata about the EText :hasPage [ :pageNum 1 ; :content “. . . 17,000 Bytes . . .” ] , [ :pageNum 2 ; :content “. . . 17,000 Bytes . . .” ] , . . . We know that Lucene is happy with this data, but we’re not sure whether Jena/TDB will be stressed with 229K triples with 17KB literals. The Jena-text offers the possibility of indexing in Lucene via a separate process and just using the search in Jena without actually storing the literals in TDB. This is a somewhat complex configuration and it would be preferred to not use this approach unless the size of the literals will present a problem. Thank you, Chris