Re: Analyzer for supporting hyphenated words
Hi Diego, let me try to help : I find this a little bit confused : "For our customer it is important to find the word - *wi-fi* by wi, *fi*, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-*" But : " The (exact) query "*FD-A320-REC-SIM-1*" returns FD-A320-REC-SIM-1 MIA-*FD-A320-REC-SIM-1* SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 " If you noticed the suffix "fi" in the first example can be compared to the suffix "FD-A320-REC-SIM-1" in the second. To qualify your requirement : Do you want the user to be able to surround the query with "" to run the phrase query with a NOT tokenized phrase ? Because by default , a phrase query is tokenized like the others, but term positions affect the matching ! In the case I identified your requirement, we can have a think to a solution! Cheers 2015-07-17 9:41 GMT+01:00 Diego Socaceti : > Hi all, > > i'm new to lucene and tried to write my own analyzer to support > hyphenated words like wi-fi, jean-pierre, etc. > For our customer it is important to find the word > - wi-fi by wi, fi, wifi, wi-fi > - jean-pierre by jean, pierre, jean-pierre, jean-* > > > > > The analyzer: > public class SupportHyphenatedWordsAnalyzer extends Analyzer { > > protected NormalizeCharMap charConvertMap; > > public MinLuceneAnalyzer() { > initCharConvertMap(); > } > > protected void initCharConvertMap() { > NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); > builder.add("\"", ""); > charConvertMap = builder.build(); > } > > @Override > protected TokenStreamComponents createComponents(final String fieldName) > { > > final Tokenizer src = new WhitespaceTokenizer(); > > TokenStream tok = new WordDelimiterFilter(src, > WordDelimiterFilter.PRESERVE_ORIGINAL > | WordDelimiterFilter.GENERATE_WORD_PARTS > | WordDelimiterFilter.GENERATE_NUMBER_PARTS > | WordDelimiterFilter.CATENATE_WORDS, > null); > tok = new LowerCaseFilter(tok); > tok = new LengthFilter(tok, 1, 255); > tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); > > return new TokenStreamComponents(src, tok); > } > > @Override > protected Reader initReader(String fieldName, Reader reader) { > return new MappingCharFilter(charConvertMap, reader); > } > } > > > > > > The analyzer seems to work except for exact phrase match queries. > > e.g. the following words are indexed > > FD-A320-REC-SIM-1 > FD-A320-REC-SIM-10 > FD-A320-REC-SIM-11 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > > The (exact) query "FD-A320-REC-SIM-1" returns > FD-A320-REC-SIM-1 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > for our customer this is wrong because this exact phrase match > query should only return the single entry FD-A320-REC-SIM-1 > > Do you have any ideas or tips, how we have to change our current > analyzer to support this requirement??? > > > Thanks and Kind regards > Diego > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
can I ?
Hi I am new to Lucene so- Can I get from search the same documents (i.e. same java object instances) that I used in my add command to the index? Thnx Yechiel
Re: can I ?
Hi Yechiel, if you refers to the same java object instances ( i.e. the same object, identified by the same reference in the heap space), the answer is no. The document you send to Lucene to be added to index is a different Object type from the one you can retrieve from a searcher. You can index a *Document* ( org.apache.lucene.document.Document ) and retrieve from the Searcher a *TopDocs* ( org.apache.lucene.search.TopDocs) that is a collection of *ScoreDoc* (org.apache.lucene.search.ScoreDoc ) . The scoreDoc simply contains the id of the document and the score. After you have the Id, you can use your IndexSearcher to actually get the stored values for your Document. Getting the *StoredDocument*(org.apache.lucene.index.StoredDocument). Remember that to retrieve the values for original ( not analysed ) field content at search time, you need to have that field to be stored. You first need to build at indexing time an additional data structure, containing the stored content for the fields. I hope this clarifies your doubts. Cheers 2015-07-21 12:09 GMT+01:00 Yechiel Feffer : > Hi > I am new to Lucene so- > Can I get from search the same documents (i.e. same java object instances) > that I used in my add command to the index? > > Thnx > Yechiel > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Analyzer for supporting hyphenated words
If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti wrote: > Hi all, > > i'm new to lucene and tried to write my own analyzer to support > hyphenated words like wi-fi, jean-pierre, etc. > For our customer it is important to find the word > - wi-fi by wi, fi, wifi, wi-fi > - jean-pierre by jean, pierre, jean-pierre, jean-* > > > > > The analyzer: > public class SupportHyphenatedWordsAnalyzer extends Analyzer { > > protected NormalizeCharMap charConvertMap; > > public MinLuceneAnalyzer() { > initCharConvertMap(); > } > > protected void initCharConvertMap() { > NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); > builder.add("\"", ""); > charConvertMap = builder.build(); > } > > @Override > protected TokenStreamComponents createComponents(final String fieldName) > { > > final Tokenizer src = new WhitespaceTokenizer(); > > TokenStream tok = new WordDelimiterFilter(src, > WordDelimiterFilter.PRESERVE_ORIGINAL > | WordDelimiterFilter.GENERATE_WORD_PARTS > | WordDelimiterFilter.GENERATE_NUMBER_PARTS > | WordDelimiterFilter.CATENATE_WORDS, > null); > tok = new LowerCaseFilter(tok); > tok = new LengthFilter(tok, 1, 255); > tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); > > return new TokenStreamComponents(src, tok); > } > > @Override > protected Reader initReader(String fieldName, Reader reader) { > return new MappingCharFilter(charConvertMap, reader); > } > } > > > > > > The analyzer seems to work except for exact phrase match queries. > > e.g. the following words are indexed > > FD-A320-REC-SIM-1 > FD-A320-REC-SIM-10 > FD-A320-REC-SIM-11 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > > The (exact) query "FD-A320-REC-SIM-1" returns > FD-A320-REC-SIM-1 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > for our customer this is wrong because this exact phrase match > query should only return the single entry FD-A320-REC-SIM-1 > > Do you have any ideas or tips, how we have to change our current > analyzer to support this requirement??? > > > Thanks and Kind regards > Diego >
Re: Analyzer for supporting hyphenated words
Hey Jack, reading the doc : " Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes." In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky : > If you don't explicitly enable automatic phrase queries, the Lucene query > parser will assume an OR operator on the sub-terms when a white > space-delimited term analyzes into a sequence of terms. > > See: > > https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) > > > -- Jack Krupansky > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti > wrote: > > > Hi all, > > > > i'm new to lucene and tried to write my own analyzer to support > > hyphenated words like wi-fi, jean-pierre, etc. > > For our customer it is important to find the word > > - wi-fi by wi, fi, wifi, wi-fi > > - jean-pierre by jean, pierre, jean-pierre, jean-* > > > > > > > > > > The analyzer: > > public class SupportHyphenatedWordsAnalyzer extends Analyzer { > > > > protected NormalizeCharMap charConvertMap; > > > > public MinLuceneAnalyzer() { > > initCharConvertMap(); > > } > > > > protected void initCharConvertMap() { > > NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); > > builder.add("\"", ""); > > charConvertMap = builder.build(); > > } > > > > @Override > > protected TokenStreamComponents createComponents(final String > fieldName) > > { > > > > final Tokenizer src = new WhitespaceTokenizer(); > > > > TokenStream tok = new WordDelimiterFilter(src, > > WordDelimiterFilter.PRESERVE_ORIGINAL > > | WordDelimiterFilter.GENERATE_WORD_PARTS > > | WordDelimiterFilter.GENERATE_NUMBER_PARTS > > | WordDelimiterFilter.CATENATE_WORDS, > > null); > > tok = new LowerCaseFilter(tok); > > tok = new LengthFilter(tok, 1, 255); > > tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); > > > > return new TokenStreamComponents(src, tok); > > } > > > > @Override > > protected Reader initReader(String fieldName, Reader reader) { > > return new MappingCharFilter(charConvertMap, reader); > > } > > } > > > > > > > > > > > > The analyzer seems to work except for exact phrase match queries. > > > > e.g. the following words are indexed > > > > FD-A320-REC-SIM-1 > > FD-A320-REC-SIM-10 > > FD-A320-REC-SIM-11 > > MIA-FD-A320-REC-SIM-1 > > SIN-FD-A320-REC-SIM-1 > > > > > > The (exact) query "FD-A320-REC-SIM-1" returns > > FD-A320-REC-SIM-1 > > MIA-FD-A320-REC-SIM-1 > > SIN-FD-A320-REC-SIM-1 > > > > for our customer this is wrong because this exact phrase match > > query should only return the single entry FD-A320-REC-SIM-1 > > > > Do you have any ideas or tips, how we have to change our current > > analyzer to support this requirement??? > > > > > > Thanks and Kind regards > > Diego > > > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Lucene 5.2.0 global ordinal based query time join on multiple indexes
Seems if I create a MultiReader from my index searchers and create the ordinal map from that MultiReader (and use an IndexSearcher created from the MultiReader in the createJoinQuery), then the correct results are found. On Mon, Jul 20, 2015 at 5:48 PM, Alex Pang wrote: > Hi, > > > > Does the Global Ordinal based query time join support joining on multiple > indexes? > > > > From my testing on 2 indexes with a common join field, the document ids I > get back from the ScoreDoc[] when searching are incorrect, though the > number of results is the same as if I use the older join query. > > > For the parent (to) index, the value of the join field is unique to each > document. > > For the child (from) index, multiple documents can have the same value for > the join field, which must be found in the parent index. > > Both indexes have a join field indexed with SortedDocValuesField. > > > The parent index had 7 segments and child index had 3 segments. > > > Ordinal map is built with: > > SortedDocValues[] values = new SortedDocValues[searcher1 > > .getIndexReader().leaves().size()]; > > for (LeafReaderContext leadContext : searcher1.getIndexReader() > > .leaves()) { > > values[leadContext.ord] = DocValues.getSorted(leadContext.reader(), > > "join_field"); > > } > > MultiDocValues.OrdinalMap ordinalMap = null; > > ordinalMap = MultiDocValues.OrdinalMap.build(searcher1.getIndexReader() > > .getCoreCacheKey(), values, PackedInts.DEFAULT); > > > Join Query: > > joinQuery = JoinUtil.createJoinQuery("join_field", > > fromQuery, > > new TermQuery(new Term("type", "to")), searcher2, > > ScoreMode.Max, ordinalMap); > > > > Thanks, > > Alex >