Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Alessandro Benedetti
Hi Diego,
let me try to help :

I find this a little bit confused :

"For our customer it is important to find the word
- *wi-fi* by wi, *fi*, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*"

But :
"
The (exact) query "*FD-A320-REC-SIM-1*" returns
FD-A320-REC-SIM-1
MIA-*FD-A320-REC-SIM-1*
SIN-FD-A320-REC-SIM-1

for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1
"

If you noticed the suffix "fi" in the first example can be compared to the
suffix "FD-A320-REC-SIM-1" in the second.
To qualify your requirement :

Do you want the user to be able to surround the query with "" to run the
phrase query with a NOT tokenized phrase ?
Because by default , a phrase query is tokenized like the others, but term
positions affect the matching !
In the case I identified your requirement, we can have a think to a
solution!


Cheers



2015-07-17 9:41 GMT+01:00 Diego Socaceti :

> Hi all,
>
> i'm new to lucene and tried to write my own analyzer to support
> hyphenated words like wi-fi, jean-pierre, etc.
> For our customer it is important to find the word
> - wi-fi by wi, fi, wifi, wi-fi
> - jean-pierre by jean, pierre, jean-pierre, jean-*
>
>
>
>
> The analyzer:
> public class SupportHyphenatedWordsAnalyzer extends Analyzer {
>
>   protected NormalizeCharMap charConvertMap;
>
>   public MinLuceneAnalyzer() {
> initCharConvertMap();
>   }
>
>   protected void initCharConvertMap() {
> NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
> builder.add("\"", "");
> charConvertMap = builder.build();
>   }
>
>   @Override
>   protected TokenStreamComponents createComponents(final String fieldName)
> {
>
> final Tokenizer src = new WhitespaceTokenizer();
>
> TokenStream tok = new WordDelimiterFilter(src,
> WordDelimiterFilter.PRESERVE_ORIGINAL
> | WordDelimiterFilter.GENERATE_WORD_PARTS
> | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> | WordDelimiterFilter.CATENATE_WORDS,
> null);
> tok = new LowerCaseFilter(tok);
> tok = new LengthFilter(tok, 1, 255);
> tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
> return new TokenStreamComponents(src, tok);
>   }
>
>   @Override
>   protected Reader initReader(String fieldName, Reader reader) {
> return new MappingCharFilter(charConvertMap, reader);
>   }
> }
>
>
>
>
>
> The analyzer seems to work except for exact phrase match queries.
>
> e.g. the following words are indexed
>
> FD-A320-REC-SIM-1
> FD-A320-REC-SIM-10
> FD-A320-REC-SIM-11
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
>
> The (exact) query "FD-A320-REC-SIM-1" returns
> FD-A320-REC-SIM-1
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
> for our customer this is wrong because this exact phrase match
> query should only return the single entry FD-A320-REC-SIM-1
>
> Do you have any ideas or tips, how we have to change our current
> analyzer to support this requirement???
>
>
> Thanks and Kind regards
> Diego
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


can I ?

2015-07-21 Thread Yechiel Feffer
Hi
I am new to Lucene so-
Can I get from search the same documents (i.e. same java object instances) that 
I used in my add command to the index?

Thnx
Yechiel


Re: can I ?

2015-07-21 Thread Alessandro Benedetti
Hi Yechiel,
if you refers to the same java object instances ( i.e. the same object,
identified by the same reference in the heap space), the answer is no.

The document you send to Lucene to be added to index is a different Object
type from the one you can retrieve from a  searcher.

You can index a  *Document* ( org.apache.lucene.document.Document ) and
retrieve from the Searcher a *TopDocs* ( org.apache.lucene.search.TopDocs)
that is a collection of *ScoreDoc* (org.apache.lucene.search.ScoreDoc ) .

The scoreDoc simply contains the id of the document and the score.
After you have the Id, you can use your IndexSearcher to actually get the
stored values for your Document.

Getting the *StoredDocument*(org.apache.lucene.index.StoredDocument).

Remember that to retrieve the values for original ( not analysed ) field
content at search time, you need to have that field to be stored.
You first need to build at indexing time an additional data structure,
containing the stored content for the fields.
I hope this clarifies your doubts.

Cheers


2015-07-21 12:09 GMT+01:00 Yechiel Feffer :

> Hi
> I am new to Lucene so-
> Can I get from search the same documents (i.e. same java object instances)
> that I used in my add command to the index?
>
> Thnx
> Yechiel
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Jack Krupansky
If you don't explicitly enable automatic phrase queries, the Lucene query
parser will assume an OR operator on the sub-terms when a white
space-delimited term analyzes into a sequence of terms.

See:
https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)


-- Jack Krupansky

On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti  wrote:

> Hi all,
>
> i'm new to lucene and tried to write my own analyzer to support
> hyphenated words like wi-fi, jean-pierre, etc.
> For our customer it is important to find the word
> - wi-fi by wi, fi, wifi, wi-fi
> - jean-pierre by jean, pierre, jean-pierre, jean-*
>
>
>
>
> The analyzer:
> public class SupportHyphenatedWordsAnalyzer extends Analyzer {
>
>   protected NormalizeCharMap charConvertMap;
>
>   public MinLuceneAnalyzer() {
> initCharConvertMap();
>   }
>
>   protected void initCharConvertMap() {
> NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
> builder.add("\"", "");
> charConvertMap = builder.build();
>   }
>
>   @Override
>   protected TokenStreamComponents createComponents(final String fieldName)
> {
>
> final Tokenizer src = new WhitespaceTokenizer();
>
> TokenStream tok = new WordDelimiterFilter(src,
> WordDelimiterFilter.PRESERVE_ORIGINAL
> | WordDelimiterFilter.GENERATE_WORD_PARTS
> | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> | WordDelimiterFilter.CATENATE_WORDS,
> null);
> tok = new LowerCaseFilter(tok);
> tok = new LengthFilter(tok, 1, 255);
> tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
> return new TokenStreamComponents(src, tok);
>   }
>
>   @Override
>   protected Reader initReader(String fieldName, Reader reader) {
> return new MappingCharFilter(charConvertMap, reader);
>   }
> }
>
>
>
>
>
> The analyzer seems to work except for exact phrase match queries.
>
> e.g. the following words are indexed
>
> FD-A320-REC-SIM-1
> FD-A320-REC-SIM-10
> FD-A320-REC-SIM-11
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
>
> The (exact) query "FD-A320-REC-SIM-1" returns
> FD-A320-REC-SIM-1
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
> for our customer this is wrong because this exact phrase match
> query should only return the single entry FD-A320-REC-SIM-1
>
> Do you have any ideas or tips, how we have to change our current
> analyzer to support this requirement???
>
>
> Thanks and Kind regards
> Diego
>


Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Alessandro Benedetti
Hey Jack, reading the doc :

" Set to true if phrase queries will be automatically generated when the
analyzer returns more than one term from whitespace delimited text. NOTE:
this behavior may not be suitable for all languages.

Set to false if phrase queries should only be generated when surrounded by
double quotes."


In the user case , i guess he's likely to use double quotes.

The only problem he sees so far is that the phrase query uses the query
time analyser to actually split the tokens.

First we need a feedback from him, but I guess he would like to have the
phrase query, to not tokenise the text within the double quotes.

In the case we should find a way.


Cheers

2015-07-21 13:12 GMT+01:00 Jack Krupansky :

> If you don't explicitly enable automatic phrase queries, the Lucene query
> parser will assume an OR operator on the sub-terms when a white
> space-delimited term analyzes into a sequence of terms.
>
> See:
>
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
>
>
> -- Jack Krupansky
>
> On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti 
> wrote:
>
> > Hi all,
> >
> > i'm new to lucene and tried to write my own analyzer to support
> > hyphenated words like wi-fi, jean-pierre, etc.
> > For our customer it is important to find the word
> > - wi-fi by wi, fi, wifi, wi-fi
> > - jean-pierre by jean, pierre, jean-pierre, jean-*
> >
> >
> >
> >
> > The analyzer:
> > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
> >
> >   protected NormalizeCharMap charConvertMap;
> >
> >   public MinLuceneAnalyzer() {
> > initCharConvertMap();
> >   }
> >
> >   protected void initCharConvertMap() {
> > NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
> > builder.add("\"", "");
> > charConvertMap = builder.build();
> >   }
> >
> >   @Override
> >   protected TokenStreamComponents createComponents(final String
> fieldName)
> > {
> >
> > final Tokenizer src = new WhitespaceTokenizer();
> >
> > TokenStream tok = new WordDelimiterFilter(src,
> > WordDelimiterFilter.PRESERVE_ORIGINAL
> > | WordDelimiterFilter.GENERATE_WORD_PARTS
> > | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > | WordDelimiterFilter.CATENATE_WORDS,
> > null);
> > tok = new LowerCaseFilter(tok);
> > tok = new LengthFilter(tok, 1, 255);
> > tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> >
> > return new TokenStreamComponents(src, tok);
> >   }
> >
> >   @Override
> >   protected Reader initReader(String fieldName, Reader reader) {
> > return new MappingCharFilter(charConvertMap, reader);
> >   }
> > }
> >
> >
> >
> >
> >
> > The analyzer seems to work except for exact phrase match queries.
> >
> > e.g. the following words are indexed
> >
> > FD-A320-REC-SIM-1
> > FD-A320-REC-SIM-10
> > FD-A320-REC-SIM-11
> > MIA-FD-A320-REC-SIM-1
> > SIN-FD-A320-REC-SIM-1
> >
> >
> > The (exact) query "FD-A320-REC-SIM-1" returns
> > FD-A320-REC-SIM-1
> > MIA-FD-A320-REC-SIM-1
> > SIN-FD-A320-REC-SIM-1
> >
> > for our customer this is wrong because this exact phrase match
> > query should only return the single entry FD-A320-REC-SIM-1
> >
> > Do you have any ideas or tips, how we have to change our current
> > analyzer to support this requirement???
> >
> >
> > Thanks and Kind regards
> > Diego
> >
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Lucene 5.2.0 global ordinal based query time join on multiple indexes

2015-07-21 Thread Alex Pang
Seems if I create a MultiReader from my index searchers and create the
ordinal map from that MultiReader (and use an IndexSearcher created from
the MultiReader in the createJoinQuery), then the correct results are found.


On Mon, Jul 20, 2015 at 5:48 PM, Alex Pang  wrote:

> Hi,
>
>
>
> Does the Global Ordinal based query time join support joining on multiple
> indexes?
>
>
>
> From my testing on 2 indexes with a common join field, the document ids I
> get back from the ScoreDoc[] when searching are incorrect, though the
> number of results is the same as if I use the older join query.
>
>
> For the parent (to) index, the value of the join field is unique to each
> document.
>
> For the child (from) index, multiple documents can have the same value for
> the join field, which must be found in the parent index.
>
> Both indexes have a join field indexed with SortedDocValuesField.
>
>
> The parent index had 7 segments and child index had 3 segments.
>
>
> Ordinal map is built with:
>
> SortedDocValues[] values = new SortedDocValues[searcher1
>
> .getIndexReader().leaves().size()];
>
> for (LeafReaderContext leadContext : searcher1.getIndexReader()
>
> .leaves()) {
>
>   values[leadContext.ord] = DocValues.getSorted(leadContext.reader(),
>
>   "join_field");
>
> }
>
> MultiDocValues.OrdinalMap ordinalMap = null;
>
> ordinalMap = MultiDocValues.OrdinalMap.build(searcher1.getIndexReader()
>
> .getCoreCacheKey(), values, PackedInts.DEFAULT);
>
>
> Join Query:
>
> joinQuery = JoinUtil.createJoinQuery("join_field",
>
>   fromQuery,
>
>   new TermQuery(new Term("type", "to")), searcher2,
>
>   ScoreMode.Max, ordinalMap);
>
>
>
> Thanks,
>
> Alex
>