reading this : " *unindexed* but stored field of sentences"
Unindexed immediately points to me to the fact you actually do not need a
tokeniser at all.
Just run an external sentence splitter ( in your indexing application), and
store the sentences as different values for a stored field.
Why t
Thanks for the comprehensive explanation, I think option 3 best fit my app.
On 23/09/2015 22:53, Steve Rowe wrote:
Unless you need to be able search on sentences-as-terms, i.e. exact sentence
matching, you should try to find an alternative; otherwise your term index will
be unnecessarily hug
Unless you need to be able search on sentences-as-terms, i.e. exact sentence
matching, you should try to find an alternative; otherwise your term index will
be unnecessarily huge.
Three things come to mind:
1. A single Lucene index can host mixed document types, e.g. full documents and
sentenc
Further to this problem, I have created a custom tokenizer but I cannot
get it loaded properly by solr.
The error stacktrace:
Exception in thread "main" org.apache.solr.common.SolrException:
SolrCore 'myproject' is not available due to init failure: Could not
load c
Thanks Steve.
It probably also makes sense to extract sentences and then store them.
But along with each sentence i also need to store its start/end offset.
I'm not sure how to do that without creating a separate index that
stores each sentence as a document? Basically the field for sentence a
Hi Ziqi,
Lucene has support for sentence chunking - see SegmentingTokenizerBase,
implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example in
that class’s tests that creates tokens out of individual sentences:
TestSegmentingTokenizerBase.WholeSentenceTokenizer.
However, it
Thanks that is understood.
My application is a bit special in the way that I need both an indexed
field with standard tokenization and an unindexed but stored field of
sentences. Both must be present for each document.
I could possibly do with PatternTokenizer, but that is of course, less
ac
Sentence recognition is usually an NLP problem. Probably best handled
outside of Solr. For example, you probably want to train and run a sentence
recognition algorithm, inject a sentence delimiter, then use that delimiter
as the basis for tokenization.
More info on sentence recognition
http://open
Hi
I need a special kind of 'token' which is a sentence, so I need a
tokenizer that splits texts into sentences.
I wonder if there is already such or similar implementations?
If I have to implement it myself, I suppose I need to implement a
subclass of Tokenizer. Having looked at a few exist