Re: tokenize into sentences/sentence splitter

2015-09-24 Thread Alessandro Benedetti
reading this : " *unindexed* but stored field of sentences" Unindexed immediately points to me to the fact you actually do not need a tokeniser at all. Just run an external sentence splitter ( in your indexing application), and store the sentences as different values for a stored field. Why t

Re: tokenize into sentences/sentence splitter

2015-09-24 Thread Ziqi Zhang
Thanks for the comprehensive explanation, I think option 3 best fit my app. On 23/09/2015 22:53, Steve Rowe wrote: Unless you need to be able search on sentences-as-terms, i.e. exact sentence matching, you should try to find an alternative; otherwise your term index will be unnecessarily hug

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Steve Rowe
Unless you need to be able search on sentences-as-terms, i.e. exact sentence matching, you should try to find an alternative; otherwise your term index will be unnecessarily huge. Three things come to mind: 1. A single Lucene index can host mixed document types, e.g. full documents and sentenc

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Further to this problem, I have created a custom tokenizer but I cannot get it loaded properly by solr. The error stacktrace: Exception in thread "main" org.apache.solr.common.SolrException: SolrCore 'myproject' is not available due to init failure: Could not load c

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Thanks Steve. It probably also makes sense to extract sentences and then store them. But along with each sentence i also need to store its start/end offset. I'm not sure how to do that without creating a separate index that stores each sentence as a document? Basically the field for sentence a

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Steve Rowe
Hi Ziqi, Lucene has support for sentence chunking - see SegmentingTokenizerBase, implemeented in ThaiTokenizer and HMMChineseTokenizer. There is an example in that class’s tests that creates tokens out of individual sentences: TestSegmentingTokenizerBase.WholeSentenceTokenizer. However, it

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Thanks that is understood. My application is a bit special in the way that I need both an indexed field with standard tokenization and an unindexed but stored field of sentences. Both must be present for each document. I could possibly do with PatternTokenizer, but that is of course, less ac

Re: tokenize into sentences/sentence splitter

2015-09-23 Thread Doug Turnbull
Sentence recognition is usually an NLP problem. Probably best handled outside of Solr. For example, you probably want to train and run a sentence recognition algorithm, inject a sentence delimiter, then use that delimiter as the basis for tokenization. More info on sentence recognition http://open

tokenize into sentences/sentence splitter

2015-09-23 Thread Ziqi Zhang
Hi I need a special kind of 'token' which is a sentence, so I need a tokenizer that splits texts into sentences. I wonder if there is already such or similar implementations? If I have to implement it myself, I suppose I need to implement a subclass of Tokenizer. Having looked at a few exist