A little help with indexing joined words

2009-10-05 Thread Andrew McCombe
Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using a
text field type.

Thanks in advance
Andrew


Re: A little help with indexing joined words

2009-10-05 Thread Avlesh Singh

 We have indexed a product database and have come across some search terms
 where zero results are returned.  There are products in the index with
 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
 respectively.

Borderland should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe eupe...@gmail.com wrote:

 Hi
 I am hoping someone can point me in the right direction with regards to
 indexing words that are concatenated together to make other words or
 product
 names.

 We have indexed a product database and have come across some search terms
 where zero results are returned.  There are products in the index with
 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
 respectively.

 Where do I look to resolve this?  The product name field is indexed using a
 text field type.

 Thanks in advance
 Andrew



Re: A little help with indexing joined words

2009-10-05 Thread Christian Zambrano
Using synonyms might be a better solution because the use of 
EdgeNGramTokenizerFactory has the potential of creating a large number 
of token which will artificially increase the number of tokens in the 
index which in turn will affect the IDF score.


A query for borderland should have returned results though. It is 
difficult to troubleshoot why it didn't without knowing what query you 
used, and what kind of analysis is taking place.


Have you tried using the analysis page on the admin section to see what 
tokens gets generated for 'Borderlands'?


Christian

On 10/05/2009 11:01 AM, Avlesh Singh wrote:

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

 

Borderland should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombeeupe...@gmail.com  wrote:

   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using a
text field type.

Thanks in advance
Andrew

 
   


Re: A little help with indexing joined words

2009-10-05 Thread Avlesh Singh

 Using synonyms might be a better solution because the use of
 EdgeNGramTokenizerFactory has the potential of creating a large number of
 token which will artificially increase the number of tokens in the index
 which in turn will affect the IDF score.

Well, I don't see a reason as to why someone would need a length based
normalization on such matches. I always have done omitNorms while using
fields with this filter.

Yes, synonyms might an answer when you have limited number of such words
(phrases) and their possible combinations.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambrano czamb...@gmail.comwrote:

 Using synonyms might be a better solution because the use of
 EdgeNGramTokenizerFactory has the potential of creating a large number of
 token which will artificially increase the number of tokens in the index
 which in turn will affect the IDF score.

 A query for borderland should have returned results though. It is
 difficult to troubleshoot why it didn't without knowing what query you used,
 and what kind of analysis is taking place.

 Have you tried using the analysis page on the admin section to see what
 tokens gets generated for 'Borderlands'?

 Christian


 On 10/05/2009 11:01 AM, Avlesh Singh wrote:

 We have indexed a product database and have come across some search terms
 where zero results are returned.  There are products in the index with
 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
 respectively.



 Borderland should have worked for a regular text field. For all other
 desired matches you can use EdgeNGramTokenizerFactory.

 Cheers
 Avlesh

 On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombeeupe...@gmail.com  wrote:



 Hi
 I am hoping someone can point me in the right direction with regards to
 indexing words that are concatenated together to make other words or
 product
 names.

 We have indexed a product database and have come across some search terms
 where zero results are returned.  There are products in the index with
 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
 respectively.

 Where do I look to resolve this?  The product name field is indexed using
 a
 text field type.

 Thanks in advance
 Andrew








Re: A little help with indexing joined words

2009-10-05 Thread Christian Zambrano
Would you mind explaining how omitNorm has any effect on the IDF problem 
I described earlier?


I agree with your second sentence. I had to use the NGramTokenFilter to 
accommodate partial matches.


On 10/05/2009 12:11 PM, Avlesh Singh wrote:

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

 

Well, I don't see a reason as to why someone would need a length based
normalization on such matches. I always have done omitNorms while using
fields with this filter.

Yes, synonyms might an answer when you have limited number of such words
(phrases) and their possible combinations.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambranoczamb...@gmail.comwrote:

   

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

A query for borderland should have returned results though. It is
difficult to troubleshoot why it didn't without knowing what query you used,
and what kind of analysis is taking place.

Have you tried using the analysis page on the admin section to see what
tokens gets generated for 'Borderlands'?

Christian


On 10/05/2009 11:01 AM, Avlesh Singh wrote:

 

We have indexed a product database and have come across some search terms
   

where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.



 

Borderland should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombeeupe...@gmail.com   wrote:



   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using
a
text field type.

Thanks in advance
Andrew



 


   
 
   


Re: A little help with indexing joined words

2009-10-05 Thread Robert Muir
fyi, if you don't want to turn off norms entirely, try this option in
lucene 2.9 DefaultSimilarity:

public void setDiscountOverlaps(boolean v)

Determines whether overlap tokens (Tokens with 0 position increment)
are ignored when computing norm. By default this is false, meaning
overlap tokens are counted just like non-overlap tokens.

 Well, I don't see a reason as to why someone would need a length based
 normalization on such matches. I always have done omitNorms while using
 fields with this filter.

--
Robert Muir
rcm...@gmail.com


Re: A little help with indexing joined words

2009-10-05 Thread Avlesh Singh
Zambrano, I was too quick to respond to your idf explanation. I definitely
did not mean that idf and length-norms are the same thing.

Andrew, this is how i would have done it -
First, I would create a field called prefix_text as undeneath in my
schema.xml
fieldType name=prefix_text class=solr.TextField
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z0-9]) replacement= replace=all/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.EdgeNGramFilterFactory maxGramSize=100
minGramSize=1/
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z0-9]) replacement= replace=all/
filter class=solr.PatternReplaceFilterFactory
pattern=^(.{20})(.*)? replacement=$1 replace=all/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Second, I would declare a field of this and populate the same (using
copyField) while indexing.

Third, while querying I would query on the both the fields. I would boost
the matches for original field to a large extent over the n-grammed field.
Scenarios where Dragon Fly is expected to match against Dragonfly in the
index, query on the original field would not give you any matches, thereby
bringing results from the prefix_token field right there on top.

Hope this helps.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:10 PM, Christian Zambrano czamb...@gmail.comwrote:

 Would you mind explaining how omitNorm has any effect on the IDF problem I
 described earlier?

 I agree with your second sentence. I had to use the NGramTokenFilter to
 accommodate partial matches.


 On 10/05/2009 12:11 PM, Avlesh Singh wrote:

 Using synonyms might be a better solution because the use of
 EdgeNGramTokenizerFactory has the potential of creating a large number of
 token which will artificially increase the number of tokens in the index
 which in turn will affect the IDF score.



 Well, I don't see a reason as to why someone would need a length based
 normalization on such matches. I always have done omitNorms while using
 fields with this filter.

 Yes, synonyms might an answer when you have limited number of such words
 (phrases) and their possible combinations.

 Cheers
 Avlesh

 On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambranoczamb...@gmail.com
 wrote:



 Using synonyms might be a better solution because the use of
 EdgeNGramTokenizerFactory has the potential of creating a large number of
 token which will artificially increase the number of tokens in the index
 which in turn will affect the IDF score.

 A query for borderland should have returned results though. It is
 difficult to troubleshoot why it didn't without knowing what query you
 used,
 and what kind of analysis is taking place.

 Have you tried using the analysis page on the admin section to see what
 tokens gets generated for 'Borderlands'?

 Christian


 On 10/05/2009 11:01 AM, Avlesh Singh wrote:



 We have indexed a product database and have come across some search
 terms


 where zero results are returned.  There are products in the index with
 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
 respectively.





 Borderland should have worked for a regular text field. For all other
 desired matches you can use EdgeNGramTokenizerFactory.

 Cheers
 Avlesh

 On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombeeupe...@gmail.com
 wrote:





 Hi
 I am hoping someone can point me in the right direction with regards to
 indexing words that are concatenated together to make other words or
 product
 names.

 We have indexed a product database and have come across some search
 terms
 where zero results are returned.  There are products in the index with
 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
 respectively.

 Where do I look to resolve this?  The product name field is indexed
 using
 a
 text field type.

 Thanks in advance
 Andrew