Re: Index search questions; special cases

2006-11-19 Thread Chris Hostetter

: Chris, thanks for the tips (or should I say, detailed explanation!). I
: actually got it working! It was a pain at first (never did any java, and

good to know .. glad it worked out for you.

: If Solr is interested in the filter, just tell me (and how should I do
: to contribute it).

The full list of instructions on how to submit a patch can be found on the
wiki...
http://wiki.apache.org/solr/HowToContribute

...ideally a patch should include unit tests demonstrating the new
feature, but if you don't have any of those (and don't feel like writing
them) a patch can still be usefull to other people (who might be
interested in writing unit tests to encourage getting the changes added)


if you do open a Jira issue and attach your code, please note this thread
and the URL of the orriginal class in nutch, so people who may stumble
accross it in Jira know where the orriginal version is.

-Hoss



Re: Index search questions; special cases

2006-11-18 Thread Michael Imbeault


CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested private static class Filter extends
TokenFilter which doesn't really have any external dependencies.  If you
extract that class into some more specificly named CommonGramsFilter,
all you need after that to use it in Solr is a simple little
FilterFactory so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...  


Chris, thanks for the tips (or should I say, detailed explanation!). I 
actually got it working! It was a pain at first (never did any java, and 
all this ant, junit, war, jar, java, .classes are confusing!). I had 
some compile errors that I cleaned up. Playing around with the filter in 
the admin panel analyser yields expected results; I can't thank you 
enough for your help. I now use :


tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 
catenateAll=0/
filter class=solr.CommonGramsFilterFactory 
words=stopwords-complete.txt ignoreCase=true/
filter class=solr.StopFilterFactory words=stopwords-complete.txt 
ignoreCase=true/


And it works perfectly.

If Solr is interested in the filter, just tell me (and how should I do 
to contribute it).


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212




http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the create method should return
a new CommonGramsFilter instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.
  


Re: Index search questions; special cases

2006-11-15 Thread Sami Siren

Erik Hatcher wrote:

Yeah, the Nutch code is highly intertwined with its unique configuration 
infrastructure and makes it hard to pull pieces of it out like this.


This is a critique that has been heard a lot (mainly because its true :)
It would be really cool if different camps of lucene could build these 
nice utilities to be usable between projects. Not exactly sure how this 
could be accomplished but anyway something to consider.


--
 Sami Siren


Re: Index search questions; special cases

2006-11-15 Thread Chris Hostetter

:  Yeah, the Nutch code is highly intertwined with its unique configuration
:  infrastructure and makes it hard to pull pieces of it out like this.

that CacheGrams inner Filter classe seemed like it could be extracted
easily enough.

: This is a critique that has been heard a lot (mainly because its true :)
: It would be really cool if different camps of lucene could build these
: nice utilities to be usable between projects. Not exactly sure how this
: could be accomplished but anyway something to consider.

[EMAIL PROTECTED] is probably the best place to raise this discussion if
you're interested in pursuing it ... i think the best way to deal with it
may just be on a case by case basis ... if you find cool code in
sub-project XYZ, start by working with XYZ-dev to refactor it into an
extractable chunk, then work with java-dev to promote it up in the
lucene Java code base, and then circle back to XYZ-dev to deprecate the
copy in the XYZ code repository and replace it with a dependency on the
newly promoted version.


-Hoss



Re: Index search questions; special cases

2006-11-14 Thread Chris Hostetter

:  : Nutch has phrase pre-filtering which helps with this. It indexes the
:  : phrase fragments as separate terms and uses that set of matches to
:  : filter the set of matching documents.

:  That reminds me ... i seem to remember someone saying once that Nutch lso
:  builds word based n-grams out of it's stop words, so searches on the
:  or on won't match anything because those words are never indexed as a
:  single tokens, but if a document contains the dog in the house it would
:  match a search on in the because the Analyzer would treat that as a
:  single token in_the.

: This looks like exactly what I'm looking for. Is it related to the above
: 'nutch pre-filtering'? This way if I stopword single letters and
: numbers, it would still index 'hepatitis_a' as a single token, and match
: a search on 'hepatitis a' (non-phrase search) without hitting 'a patient
: has hepatitis'? I guess i'd have to apply the filter to the query too,
: so it turns the query into hepatitis_a?

right ... i think we were both talking baout the same feature, which Otis
says is in Nutch's CommonGrams class...

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/CommonGrams.java?view=markup

: Any chance at all this kind of filter gets implemented into solr? If
: not, indications on how to do it myself would be appreciated - I can't

CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested private static class Filter extends
TokenFilter which doesn't really have any external dependencies.  If you
extract that class into some more specificly named CommonGramsFilter,
all you need after that to use it in Solr is a simple little
FilterFactory so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...

http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the create method should return
a new CommonGramsFilter instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.

-Hoss



Re: Index search questions; special cases

2006-11-14 Thread Erik Hatcher


On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote:
CommonGrams itself seems to have some other dependencies on nutch  
because
of other utilities in the same class, but based on a quick skim,  
what you

really want is the nested private static class Filter extends
TokenFilter which doesn't really have any external dependencies.   
If you
extract that class into some more specificly named  
CommonGramsFilter,...


Yeah, the Nutch code is highly intertwined with its unique  
configuration infrastructure and makes it hard to pull pieces of it  
out like this.


Erik



Re: Index search questions; special cases

2006-11-13 Thread Walter Underwood
On 11/12/06 8:52 PM, Michael Imbeault [EMAIL PROTECTED]
wrote:

 Sadly I can't rely on users smartness for this :) I have concerns that
 for stuff like Hepatitis A, it will match just about every document
 containing hepatitis and the very common 'a' word, anywhere in the
 document. I can't stopword single letters, cause then there would be no
 way to find documents about 'hepatitis c' and not about 'hepatitis b'
 for example. I will test my solution and report; if you have any other
 ideas, just tell me.

Nutch has phrase pre-filtering which helps with this. It indexes the
phrase fragments as separate terms and uses that set of matches to
filter the set of matching documents.

Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.

A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle .NET and vitamin a. I know that AltaVista and Inktomi
did this.

wunder
-- 
Walter Underwood
Search Guru, Netflix

 



Re: Index search questions; special cases

2006-11-13 Thread Yonik Seeley

On 11/13/06, Walter Underwood [EMAIL PROTECTED] wrote:

Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.


One could use the synonym filter (which can handle multi-token
synonyms) to get this effect.

WordDelimiterFilter = SynonymFilter = StopwordFilter = Stemmer

The SynonymFilter could have the following config:
hepatitis a, hepatitis_a

Do expand=true on the indexing analyzer, and expand=false on the
query analyzer

Then, a doc with hepatitis a will end up indexing hepatitus and
hepatitis_a
And at query time all the following searches will find the doc:
  text:hepatitus
  text:hepatitis a
  text:hepatitis-a


A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle .NET and vitamin a. I know that AltaVista and Inktomi
did this.


That's not a bad idea... most of the code from the multi-token
SynonymFilter could be reused to efficiently recognize multi-token
matches.

-Yonik


Re: Index search questions; special cases

2006-11-13 Thread Chris Hostetter

:  Sadly I can't rely on users smartness for this :) I have concerns that
:  for stuff like Hepatitis A, it will match just about every document
:  containing hepatitis and the very common 'a' word, anywhere in the
:  document. I can't stopword single letters, cause then there would be no
:  way to find documents about 'hepatitis c' and not about 'hepatitis b'

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.

That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on the
or on won't match anything because those words are never indexed as a
single tokens, but if a document contains the dog in the house it would
match a search on in the becaue the Analyzer would treat that as a
single token in_the.

something like thta might work as well.



-Hoss



Re: Index search questions; special cases

2006-11-13 Thread Yonik Seeley

On 11/12/06, Michael Imbeault [EMAIL PROTECTED] wrote:

- Somewhat related : Let's say I index Polymyxin B. If I stopword
single letters, would a phrase search (Polymyxin B) still find the
right documents (I don't think so, but still)? If not, I'll have to
index single letters; how do I prevent the same problem as in the first
question (i.e., a search on Polymyxin B yielding documents with
Polymyxin and B, but not close to one another).


The general problem seems that you can tell what should be in a phrase
search and what shouldn't

You could try throwing everything in a sloppy phrase query, so at
least scores will go up when terms are closer together (in general).

You could also try an exact phrase query, and if you don't get enough
results, follow it up with another strategy (like what you have
below).


My thought is to parse the user query and rephrase it to do phrase
searches on nearby terms containing single letters / numbers. If an user
search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR
(1 hepatitis AND hiv). Is it a sensible solution?


That might work.
Whatever general strategy you end up trying, you can probably boost
relevancy with some domain specific knowledge injected with something
like the SynonymFilter.

-Yonik


Re: Index search questions; special cases

2006-11-13 Thread Yonik Seeley

On 11/13/06, Yonik Seeley [EMAIL PROTECTED] wrote:

The SynonymFilter could have the following config:
hepatitis a, hepatitis_a


Oops, the synonyms should be reversed like so:
hepatitis_a, hepatitis a
so that when expand=false for querying, hepatitis a is mapped to hepatitis_a

-Yonik


Re: Index search questions; special cases

2006-11-13 Thread Erik Hatcher


On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
That reminds me ... i seem to remember someone saying once that  
Nutch lso

builds word based n-grams out of it's stop words, so searches on the
or on won't match anything because those words are never indexed  
as a
single tokens, but if a document contains the dog in the house it  
would

match a search on in the becaue the Analyzer would treat that as a
single token in_the.



Yup we covered this in LIA:

http://lucenebook.com/search?query=nutch+stop+words




Re: Index search questions; special cases

2006-11-13 Thread Otis Gospodnetic
Indeed.  CommonGrams.java in Nutch is the place to look.

Otis

- Original Message 
From: Erik Hatcher [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 13, 2006 2:08:51 PM
Subject: Re: Index  search questions; special cases


On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
 That reminds me ... i seem to remember someone saying once that  
 Nutch lso
 builds word based n-grams out of it's stop words, so searches on the
 or on won't match anything because those words are never indexed  
 as a
 single tokens, but if a document contains the dog in the house it  
 would
 match a search on in the becaue the Analyzer would treat that as a
 single token in_the.


Yup we covered this in LIA:

http://lucenebook.com/search?query=nutch+stop+words







Re: Index search questions; special cases

2006-11-13 Thread Michael Imbeault

Hello everyone,

Thanks for all your answers; synonyms based approaches won't work 
because the medical / research field is evolving way too fast; it would 
become unmaintainable very quickly, and the list would be huge. Anyway, 
I can't rely on score because I'm sorting by date, so I need to 
eliminate the 'hiv' in one part of the doc and '1' in another part 
problem completely (if I want docs that fits HIV-1, or Polymyxin B, or 
hepatitis A - I don't want docs that fits 'A patient was cured of 
hepatitis C' if I search for 'hepatitis a').

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.
  
Is this a filter that I could implement easily into Solr? I never did 
java, but it can't be that complicated I guess. Any help would be 
appreciated.



That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on the
or on won't match anything because those words are never indexed as a
single tokens, but if a document contains the dog in the house it would
match a search on in the because the Analyzer would treat that as a
single token in_the.
  


This looks like exactly what I'm looking for. Is it related to the above 
'nutch pre-filtering'? This way if I stopword single letters and 
numbers, it would still index 'hepatitis_a' as a single token, and match 
a search on 'hepatitis a' (non-phrase search) without hitting 'a patient 
has hepatitis'? I guess i'd have to apply the filter to the query too, 
so it turns the query into hepatitis_a?


Basically, its another way to what I proposed as a solution - rewrite 
the query to include phrase queries when you find a stopword, if you 
index them anyway. Still, this solution looks better, as the size of the 
index would probably be smaller than if I didn't stopword single letters 
at all? For reference, what I proposed was:


My thought is to parse the user query and rephrase it to do phrase 
searches on nearby terms containing single letters / numbers. If an 
user search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND 
hepatitis) OR (1 hepatitis AND hiv). Is it a sensible solution?
Any chance at all this kind of filter gets implemented into solr? If 
not, indications on how to do it myself would be appreciated - I can't 
say I have a clue right now (never did java, the only lucene programming 
I did was via a php bridge).


Thanks for the help,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212





Re: Index search questions; special cases

2006-11-12 Thread Chris Hostetter

: - Let's say I index HIV-1 with filter
: class=solr.WordDelimiterFilterFactory generateWordParts=1
: generateNumberParts=1 catenateWords=1 catenateNumbers=1
: catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which
: after parsing by the above filter would yield HIV1 or HIV 1) also find
: documents which have HIV and the number 1 somewhere in the document,
: but not directly after HIV? If so, how should I fix this? I could boost
: score by proximity, but I'm doing a sort on date anyway, so I guess it
: would be pointless to do so.

A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)

Have you by any chance played with the Analysis page on your Solr index?
  
http://localhost:8983/solr/admin/analysis.jsp?name=verbose=onhighlight=onqverbose=on;

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
debugQuery=on option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.

: - Somewhat related : Let's say I index Polymyxin B. If I stopword
: single letters, would a phrase search (Polymyxin B) still find the
: right documents (I don't think so, but still)? If not, I'll have to

depends on what the right documents are .. if you strip stopwords out
both at index time and at query time then it will ultimately match exctly
the same thing as a query on Polymyxin which i guess must be the right
documents since no documents will container the letter B so what else
could be right? :)

: index single letters; how do I prevent the same problem as in the first
: question (i.e., a search on Polymyxin B yielding documents with
: Polymyxin and B, but not close to one another).
:
: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR
: (1 hepatitis AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where 1 doesn't appear near
HIV or hepatitis then they will start entering 'HIV 1 hepatitis (or
'HIV 1 hepatits' if that's what they ment.)




-Hoss



Re: Index search questions; special cases

2006-11-12 Thread Michael Imbeault

Chris Hostetter wrote:

A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)
  

For clarification, I'm using the filter both at index and query time.

Have you by any chance played with the Analysis page on your Solr index?
  
http://localhost:8983/solr/admin/analysis.jsp?name=verbose=onhighlight=onqverbose=on;

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
debugQuery=on option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.
  
Will try that, played with it in the past, but not for this particular 
problem, good idea :)

: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as (HIV 1 AND hepatitis) OR
: (1 hepatitis AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where 1 doesn't appear near
HIV or hepatitis then they will start entering 'HIV 1 hepatitis (or
'HIV 1 hepatits' if that's what they ment.)
  
Sadly I can't rely on users smartness for this :) I have concerns that 
for stuff like Hepatitis A, it will match just about every document 
containing hepatitis and the very common 'a' word, anywhere in the 
document. I can't stopword single letters, cause then there would be no 
way to find documents about 'hepatitis c' and not about 'hepatitis b' 
for example. I will test my solution and report; if you have any other 
ideas, just tell me.


And thanks for the help! :)