RE: Can't mix Synonyms with Shingles?

2011-08-10 Thread Steven A Rowe
Hi Jeff,

Hi Jeff,

You have configured ShingleFilterFactory with a token separator of , so e.g. 
International Corporation will output the shingle InternationalCorporation. 
 If this is the form you want to use for synonym matching, it must exist in 
your synonym file.  Does it?

Steve

 -Original Message-
 From: Jeff Wartes [mailto:jwar...@whitepages.com]
 Sent: Wednesday, August 10, 2011 3:43 PM
 To: solr-user@lucene.apache.org
 Subject: Can't mix Synonyms with Shingles?
 
 
 I would like to combine the ShingleFilterFactory with a
 SynonymFilterFactory in a field type.
 
 I've looked at something like this using the analysis.jsp tool:
 
 fieldType name=TestTerm class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 stemEnglishPosessive=1/
 filter class=solr.ShingleFilterFactory tokenSeparator= /
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.BusinessNames.txt ignoreCase=true expand=true/
 ...
   /analyzer
   analyzer type=query
   ...
   /analyzer
 /fieldType
 
 However, when a ShingleFilterFactory is applied first, the
 SynonymFilterFactory appears to do nothing.
 I haven't found any documentation or other warnings against this
 combination, and I don't want to apply shingles after synonyms (this
 works) because multi-word synonyms then cause severe term expansion. I
 don't really mind if the synonyms fail to match shingles, (although I'd
 prefer they succeed) but I'd at least expect that synonyms would continue
 to match on the original tokens, as they do if I remove the
 ShingleFilterFactory.
 
 I'm using Solr 3.3, any clarification would be appreciated.
 
 Thanks,
   -Jeff Wartes



RE: Can't mix Synonyms with Shingles?

2011-08-10 Thread Jeff Wartes

Hi Steven,

The token separator was certainly a deliberate choice, are you saying that 
after applying shingles, synonyms can only match shingled terms? The term 
analysis suggests the original tokens still exist. 
You've made me realize that only certain synonyms seem to have problems though, 
so it's not a blanket failure.

Take this synonym definition:
wamu, washington mutual bank, washington mutual

Indexing wamu looks like it'll work fine - there are no shingles, and all 
three synonym expansions appear to get indexed. (expand=true) However, 
indexing washington mutual applies the shingles correctly, (adds 
washingtonmutual to position 1) but the synonym expansion does not happen. I 
would still expect the synonym definition to match the original terms and index 
'wamu' along with the other stuff.

Thanks.



-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: Wednesday, August 10, 2011 12:54 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't mix Synonyms with Shingles?

Hi Jeff,

Hi Jeff,

You have configured ShingleFilterFactory with a token separator of , so e.g. 
International Corporation will output the shingle InternationalCorporation. 
 If this is the form you want to use for synonym matching, it must exist in 
your synonym file.  Does it?

Steve

 -Original Message-
 From: Jeff Wartes [mailto:jwar...@whitepages.com]
 Sent: Wednesday, August 10, 2011 3:43 PM
 To: solr-user@lucene.apache.org
 Subject: Can't mix Synonyms with Shingles?
 
 
 I would like to combine the ShingleFilterFactory with a 
 SynonymFilterFactory in a field type.
 
 I've looked at something like this using the analysis.jsp tool:
 
 fieldType name=TestTerm class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 stemEnglishPosessive=1/
 filter class=solr.ShingleFilterFactory tokenSeparator= /
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.BusinessNames.txt ignoreCase=true expand=true/
 ...
   /analyzer
   analyzer type=query
   ...
   /analyzer
 /fieldType
 
 However, when a ShingleFilterFactory is applied first, the 
 SynonymFilterFactory appears to do nothing.
 I haven't found any documentation or other warnings against this 
 combination, and I don't want to apply shingles after synonyms (this
 works) because multi-word synonyms then cause severe term expansion. I 
 don't really mind if the synonyms fail to match shingles, (although 
 I'd prefer they succeed) but I'd at least expect that synonyms would 
 continue to match on the original tokens, as they do if I remove the 
 ShingleFilterFactory.
 
 I'm using Solr 3.3, any clarification would be appreciated.
 
 Thanks,
   -Jeff Wartes



RE: Can't mix Synonyms with Shingles?

2011-08-10 Thread Jeff Wartes

After some further playing around, I think I understand what's going on. 
Because the SynonymFilterFactory pays attention to term position when it 
inserts a multi-word synonym, I had assumed it scanned for matches in a way 
that respected term position as well. (ie, for a two-word synonym, I assumed it 
would try to find the second word in position n+1 if it found the first word in 
position n) 

This does not appear to be the case. It appears to find multi-word synonym 
matches by simply walking the list of terms, exhausting all the terms in 
position one before looking at any terms in position two. The ShingleFilter 
adds terms to most positions, so that throws off the 'adjacency' of the 
flattened list of terms. Meaning, a two-word synonym can only match if the 
synonym consists of the original term (position 1) followed by the added 
shingle (also in position 1).
Perhaps a better description is if you're looking at the analysis.jsp display, 
it does not scan for multi-word synonym tokens across then down, it scans 
down then across.


It doesn't look like there's a way to do what I'm trying to do (index shingles 
AND multi-word synonyms in one field) without writing my own filter.


-Original Message-
From: Jeff Wartes [mailto:jwar...@whitepages.com] 
Sent: Wednesday, August 10, 2011 1:27 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't mix Synonyms with Shingles?


Hi Steven,

The token separator was certainly a deliberate choice, are you saying that 
after applying shingles, synonyms can only match shingled terms? The term 
analysis suggests the original tokens still exist. 
You've made me realize that only certain synonyms seem to have problems though, 
so it's not a blanket failure.

Take this synonym definition:
wamu, washington mutual bank, washington mutual

Indexing wamu looks like it'll work fine - there are no shingles, and all 
three synonym expansions appear to get indexed. (expand=true) However, 
indexing washington mutual applies the shingles correctly, (adds 
washingtonmutual to position 1) but the synonym expansion does not happen. I 
would still expect the synonym definition to match the original terms and index 
'wamu' along with the other stuff.

Thanks.



-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu]
Sent: Wednesday, August 10, 2011 12:54 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't mix Synonyms with Shingles?

Hi Jeff,

Hi Jeff,

You have configured ShingleFilterFactory with a token separator of , so e.g. 
International Corporation will output the shingle InternationalCorporation. 
 If this is the form you want to use for synonym matching, it must exist in 
your synonym file.  Does it?

Steve

 -Original Message-
 From: Jeff Wartes [mailto:jwar...@whitepages.com]
 Sent: Wednesday, August 10, 2011 3:43 PM
 To: solr-user@lucene.apache.org
 Subject: Can't mix Synonyms with Shingles?
 
 
 I would like to combine the ShingleFilterFactory with a 
 SynonymFilterFactory in a field type.
 
 I've looked at something like this using the analysis.jsp tool:
 
 fieldType name=TestTerm class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 stemEnglishPosessive=1/
 filter class=solr.ShingleFilterFactory tokenSeparator= /
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.BusinessNames.txt ignoreCase=true expand=true/
 ...
   /analyzer
   analyzer type=query
   ...
   /analyzer
 /fieldType
 
 However, when a ShingleFilterFactory is applied first, the 
 SynonymFilterFactory appears to do nothing.
 I haven't found any documentation or other warnings against this 
 combination, and I don't want to apply shingles after synonyms (this
 works) because multi-word synonyms then cause severe term expansion. I 
 don't really mind if the synonyms fail to match shingles, (although 
 I'd prefer they succeed) but I'd at least expect that synonyms would 
 continue to match on the original tokens, as they do if I remove the 
 ShingleFilterFactory.
 
 I'm using Solr 3.3, any clarification would be appreciated.
 
 Thanks,
   -Jeff Wartes



Re: Can't mix Synonyms with Shingles?

2011-08-10 Thread Robert Muir
On Wed, Aug 10, 2011 at 7:10 PM, Jeff Wartes jwar...@whitepages.com wrote:

 After some further playing around, I think I understand what's going on. 
 Because the SynonymFilterFactory pays attention to term position when it 
 inserts a multi-word synonym, I had assumed it scanned for matches in a way 
 that respected term position as well. (ie, for a two-word synonym, I assumed 
 it would try to find the second word in position n+1 if it found the first 
 word in position n)

 This does not appear to be the case. It appears to find multi-word synonym 
 matches by simply walking the list of terms, exhausting all the terms in 
 position one before looking at any terms in position two.

this is correct: and i think it would cause some serious bad
performance otherwise: if you have a tokenstream like this: (A B C) (D
E F) (G H I) ..., and are matching multiword synonyms, it can
potentially explode at least in terms of cpu time and all the
state-saving/restoring/copying and stuff it would need to start
considering the tokenstream as more of a token-confusion-network, and
it gets worse if you think about position increments  1.

at least recently in svn, the limitation is documented:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymFilter.java

-- 
lucidimagination.com