date:20070920

a bug in commit script?

2007-09-20 Thread Yu-Hui Jin

Hi, guys,

It seems there's a small bug in the bin/commit script for solr 1.2.

I was able to run snapinstaller successfully to install the index and open a
new searcher. (This is verified by querying the new docs through the web
admin UI.) However, the snapinstaller script failed due to the commit
script's failure. The commit.log shows:

2007/09/19 23:54:43 started by solruser
2007/09/19 23:54:43 command: /var/SolrHome/solr/bin/commit
2007/09/19 23:54:43 commit request to Solr at
http://localhost:6080/solr/update failed:
2007/09/19 23:54:43 ?xml version=1.0 encoding=UTF-8? response lst
name=responseHeaderint name=status0/intint
name=QTime47/int/lst 
/response
2007/09/19 23:54:43 failed (elapsed time: 0 sec)

I then checked the commit script which has the following line:

   echo $rs | grep 'result.*status=0'  /dev/null 21

However, this is the older pattern of the response. The XML schema changed
in 1.2.  Should someone fix this?


-- 
Regards,

-Hui

Re: a bug in commit script?

2007-09-20 Thread Chris Hostetter

: 
: It seems there's a small bug in the bin/commit script for solr 1.2.

A fix was already commited to the trunk for this as part of SOLR-282 (but 
there doesn't seem to be a note about it in the changelog)


-Hoss

Re: Solr Index - no segments* file found in org.apache.lucene.store.FSDirectory

2007-09-20 Thread Chris Hostetter


: Does this case arise when i do a search when there is no index?? -  If yes,
: then i guess the Exception can be made more meaningful.

in normal operation, i believe this shouldn't happen -- Solr will create 
the index for you on startup if there isn't one.  You're attampting a 
fairly advanced / non trivial approach where you aren't letting Solr 
manage the index for you.

you haven't given us any idea what the code you are using to build the 
index looks like -- but if i had to guess, i would bet that somewhere in 
there you are directly manupulating the file system directory -- not just 
the Lucene FSDirectory.  that's the only situation i can think of where 
that index directory would ever exist but be completely empty -- the low
level Lucene APIs should create a segments file as soon as you start 
adding docs, even if you haven't lcosed the writer yet.



-Hoss

Re: How can i make a distribute search on Solr?

2007-09-20 Thread David Welton

 Maybe I got this wrong...but isn't this what mapreduce is meant to deal with?
 eg,

 1) get the job (a query)
 2) map it to workers ( servers that provide search results from their own
 indexing)
 3) wait for the results from all workers that reply within acceptable 
 timeframe.
 4) comb through the lot of  results from all workers, reduce them according to
 your own biz rules (eg, remove dupes, sort them by quality / priority... here 
 possibly relying on the original parameters of the query in 1)
 5) return the reduced results to the frontend.

That seems to be how Sphinx works:

http://www.sphinxsearch.com/doc.html#distributed

Of course, the details of this are far over my head for either system,
so I don't really know if that's a sensible way of doing things or
not.

Ciao,
-- 
David N. Welton
http://www.welton.it/davidw/

Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler

On Thu, 2007-09-20 at 10:11 +0200, Thierry Collogne wrote:
 Hello,
 
 We are experiencing some strange behavior while searching with words
 containing accents.
 We are using two examples rené and matthé
 
 When we search for rené or for rene, we get the same results, so that is
 ok.
 But when we search for matthé or for matthe, we get two totally
 different results.
 
 Can someone tell me why this happens? We would like the results to be the
 same.

That highly depends on your schema. Do you use filter
class=solr.ISOLatin1AccentFilterFactory/?

I am using the following an it works like a charm
fieldType name=stringSimilar class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=false/
  /analyzer
  analyzer type=query
!--tokenizer class=solr.LowerCaseTokenizerFactory/--
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=false/
  /analyzer
/fieldType

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

2007-09-20 Thread Bertrand Delacretaz

On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote:

 ..when we search for matthé or for matthe, we get two totally
 different results

The analyzer admin tool should help you find out what's happening, see
http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9

-Bertrand

Re: Strange behavior when searching with accents

2007-09-20 Thread Thierry Collogne

We are using this schema definition

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
filter class=solr.StopFilterFactory ignoreCase=true words=
stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory protected=
protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true words=
stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory protected=
protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
/fieldType

I will take a look at the analyzer took.

Thank you both for the quick response.

On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote:

 On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote:

  ..when we search for matthé or for matthe, we get two totally
  different results

 The analyzer admin tool should help you find out what's happening, see

 http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9

 -Bertrand

Re: Strange behavior when searching with accents

2007-09-20 Thread Thierry Collogne

I have entered the the matthé term in the the analyzer, but as far as I
understand, it should be ok. I have made some screenshots with the results.

http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg

http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg

I find it strange that the second screenshost doesnt give any matches.

Can someone take a look at them and perhaps clarify why it does not work?

Thank you.


On 20/09/2007, Thierry Collogne  [EMAIL PROTECTED] wrote:

 We are using this schema definition

 fieldType name=text class= solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class= solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 filter class=solr.StopFilterFactory ignoreCase=true words=
 stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory protected=
 protwords.txt/
 filter class= solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.ISOLatin1AccentFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class= solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class= solr.StopFilterFactory ignoreCase=true words=
 stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory protected=
 protwords.txt/
 filter class= solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.ISOLatin1AccentFilterFactory/
   /analyzer
 /fieldType

 I will take a look at the analyzer took.

 Thank you both for the quick response.

 On 20/09/2007, Bertrand Delacretaz  [EMAIL PROTECTED]  wrote:
 
  On 9/20/07, Thierry Collogne  [EMAIL PROTECTED] wrote:
 
   ..when we search for matthé or for matthe, we get two totally
   different results
 
  The analyzer admin tool should help you find out what's happening, see
  http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
 
 
  -Bertrand

Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler

On Thu, 2007-09-20 at 13:33 +0200, Thierry Collogne wrote:
 We are using this schema definition
 


Thierry, try to move the solr.ISOLatin1AccentFilterFactory up the filter
cue, like:

...
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
...

for both indexing and query. 

This way you make sure that all accent are gone before you do further
filtering.

You may need to reindex all documents to make sure we are not going to
use the old index.

HTH

salu2

 fieldType name=text class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 filter class=solr.StopFilterFactory ignoreCase=true words=
 stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory protected=
 protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.ISOLatin1AccentFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true words=
 stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory protected=
 protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.ISOLatin1AccentFilterFactory/
   /analyzer
 /fieldType
 
 I will take a look at the analyzer took.
 
 Thank you both for the quick response.
 
 On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote:
 
  On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote:
 
   ..when we search for matthé or for matthe, we get two totally
   different results
 
  The analyzer admin tool should help you find out what's happening, see
 
  http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
 
  -Bertrand
 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler

On Thu, 2007-09-20 at 14:01 +0200, Thierry Collogne wrote:
 I have entered the the matthé term in the the analyzer, but as far as I
 understand, it should be ok. I have made some screenshots with the results.
 
 http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg
 
 http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg
 
 I find it strange that the second screenshost doesnt give any matches.
 
 Can someone take a look at them and perhaps clarify why it does not work?

See my other response, but the 2nd screenshoot has changed the the
query field using the non accent way.

Further you want to use the verbose output option to better analyze.

salu2

 
 Thank you.
 
 
 On 20/09/2007, Thierry Collogne  [EMAIL PROTECTED] wrote:
 
  We are using this schema definition
 
  fieldType name=text class= solr.TextField positionIncrementGap=100
analyzer type=index
  tokenizer class= solr.WhitespaceTokenizerFactory/
  !-- in this example, we will only use synonyms at query time
  filter class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true expand=false/
  --
  filter class=solr.StopFilterFactory ignoreCase=true words=
  stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
  catenateNumbers=1 catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory protected=
  protwords.txt/
  filter class= solr.RemoveDuplicatesTokenFilterFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
/analyzer
analyzer type=query
  tokenizer class= solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
  filter class= solr.StopFilterFactory ignoreCase=true words=
  stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory protected=
  protwords.txt/
  filter class= solr.RemoveDuplicatesTokenFilterFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
/analyzer
  /fieldType
 
  I will take a look at the analyzer took.
 
  Thank you both for the quick response.
 
  On 20/09/2007, Bertrand Delacretaz  [EMAIL PROTECTED]  wrote:
  
   On 9/20/07, Thierry Collogne  [EMAIL PROTECTED] wrote:
  
..when we search for matthé or for matthe, we get two totally
different results
  
   The analyzer admin tool should help you find out what's happening, see
   http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
  
  
   -Bertrand
  
 
 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

2007-09-20 Thread Bertrand Delacretaz

On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote:

 ...Thank you very much. Moving the filter class=
 solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it

Yes, the problem was the EnglishPorterFilterFactory before the accents
removal: the stemmer doesn't know about accents, so no stemming
occured on matthé whereas matthe was stemmed to matth.

BTW, your rené example makes me think you're indexing french, if
that's the case you might want to use a stemmer configured for that
language, for example

filter
  class=Solr.SnowballPorterFilterFactory
  language=French/

-Bertrand

Re: Strange behavior when searching with accents

2007-09-20 Thread Thierry Collogne

Thorsten,

Thank you very much. Moving the filter class=
solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it.


On 20/09/2007, Thorsten Scherler [EMAIL PROTECTED]
wrote:

 On Thu, 2007-09-20 at 14:01 +0200, Thierry Collogne wrote:
  I have entered the the matthé term in the the analyzer, but as far as I
  understand, it should be ok. I have made some screenshots with the
 results.
 
  http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg
 
  http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg
 
  I find it strange that the second screenshost doesnt give any matches.
 
  Can someone take a look at them and perhaps clarify why it does not
 work?

 See my other response, but the 2nd screenshoot has changed the the
 query field using the non accent way.

 Further you want to use the verbose output option to better analyze.

 salu2

 
  Thank you.
 
 
  On 20/09/2007, Thierry Collogne  [EMAIL PROTECTED] wrote:
  
   We are using this schema definition
  
   fieldType name=text class= solr.TextField
 positionIncrementGap=100
 analyzer type=index
   tokenizer class= solr.WhitespaceTokenizerFactory/
   !-- in this example, we will only use synonyms at query time
   filter class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true expand=false/
   --
   filter class=solr.StopFilterFactory ignoreCase=true
 words=
   stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1 catenateWords=1
   catenateNumbers=1 catenateAll=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory protected=
   protwords.txt/
   filter class= solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.ISOLatin1AccentFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class= solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory synonyms=
 synonyms.txt
   ignoreCase=true expand=true/
   filter class= solr.StopFilterFactory ignoreCase=true
 words=
   stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1 catenateWords=0
   catenateNumbers=0 catenateAll=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory protected=
   protwords.txt/
   filter class= solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.ISOLatin1AccentFilterFactory/
 /analyzer
   /fieldType
  
   I will take a look at the analyzer took.
  
   Thank you both for the quick response.
  
   On 20/09/2007, Bertrand Delacretaz  [EMAIL PROTECTED]  wrote:
   
On 9/20/07, Thierry Collogne  [EMAIL PROTECTED] wrote:
   
 ..when we search for matthé or for matthe, we get two totally
 different results
   
The analyzer admin tool should help you find out what's happening,
 see
   
 http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
   
   
-Bertrand
   
  
  
 --
 Thorsten Scherler thorsten.at.apache.org
 Open Source Java  consulting, training and solutions

Re: Term extraction

2007-09-20 Thread Michael Kimsal

Not sure if this is in the same league or not, but Yahoo offers a term
extraction
web service.

http://developer.yahoo.com/search/content/V1/termExtraction.html



On 9/20/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

 You might investigate some tools like Alias-i's LingPipe or do some
 searches for phrase recognition software, etc.

 -Grant

 On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote:

  I'm currently looking at methods of term extraction and automatic
  keyword
  generation from indexed documents.  I've been experimenting with
  MoreLikeThis and values returned by the mlt.interestingTerms
  parameter and
  so far this approach has worked well.  However, I'd like to be able to
  analyze documents more intelligently to recognize phrase keywords
  such as
  open source, Microsoft Office, Bill Gates rather than
  splitting each
  word into separate tokens (the field is never used in search
  queries so
  matching is not an issue).  I've been looking at
  SynonymFilterFactory as a
  possible solution to this problem but haven't been able to work out
  the
  specifics of how to configure it for phrase mappings.
 
  Has anybody else dealt with this problem before or able to offer any
  insights into achieve the desired results?
 
  Thanks in advance,
  Pieter

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ





-- 
Michael Kimsal
http://webdevradio.com

Re: Filter by Group

2007-09-20 Thread mark angelillo


Thanks, Pieter. I'll go for that then.

Mark

On Sep 19, 2007, at 10:15 PM, Pieter Berkel wrote:


Sounds like you're on the right track, if your groups overap (i.e. a
document can be in group A and B), then you should ensure your  
groups

field is multivalued.

If you are searching for foo in documents contained in group A,  
then it

might be more efficient to use a filter query (fq) like:

q=foofq=groups:A

See the wiki page on common query parameters for more info:
http://wiki.apache.org/solr/ 
CommonQueryParameters#head-6522ef80f22d0e50d2f12ec487758577506d6002


cheers,
Piete



On 20/09/2007, mark angelillo [EMAIL PROTECTED] wrote:


Hey all,

Let's say I have an index of one hundred documents, and these
documents are grouped into 4 groups A, B, C, and D. The groups do in
fact overlap. What would people recommend as the best way to apply a
search query and return only the documents that are in group A? Also,
how about if we run the same search query but return only those
documents in groups A, C and D?

I imagine that I could do this by indexing a text field populated
with the group names and adding something like groups:A to the
query but I'm wondering if there's a better solution.

Thanks in advance,
Mark

mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
[EMAIL PROTECTED]
snooth -- 1.7 million ratings and counting...





mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
[EMAIL PROTECTED]
snooth -- 1.7 million ratings and counting...

Re: Strange behavior when searching with accents

2007-09-20 Thread Thierry Collogne

We are indexing both french and dutch. I will take a look at
SnowballPorterFilterFactory later, but thanks for the advice.

On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote:

 On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote:

  ...Thank you very much. Moving the filter class=
  solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it

 Yes, the problem was the EnglishPorterFilterFactory before the accents
 removal: the stemmer doesn't know about accents, so no stemming
 occured on matthé whereas matthe was stemmed to matth.

 BTW, your rené example makes me think you're indexing french, if
 that's the case you might want to use a stemmer configured for that
 language, for example

 filter
   class=Solr.SnowballPorterFilterFactory
   language=French/

 -Bertrand

Re: How can i make a distribute search on Solr?

2007-09-20 Thread Yonik Seeley

On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
 Maybe I got this wrong...but isn't this what mapreduce is meant to deal with?

Not really... you could force a *lot* of different problems into
map-reduce (that's sort of the point... being able to automatically
parallelize a lot of different problems).  It really isn't the best
fit though, and would end up being much slower than a custom job.

Then there is the issue that the way map-reduce is implemented (like
hadoop) is also tuned for longer running batch jobs on huge data
(temporary files are used, external sorts, initial input, final output
is via files, etc).  Check out the google map-reduce paper - they
don't use it for their search side either.


Things are already progressing in the distributed search area:
https://issues.apache.org/jira/browse/SOLR-303
Hopefully I'll have time to dig into it more myself in a few weeks.

-Yonik

Re: Term extraction

2007-09-20 Thread Yonik Seeley

On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 However, I'd like to be able to
 analyze documents more intelligently to recognize phrase keywords such as
 open source, Microsoft Office, Bill Gates rather than splitting each
 word into separate tokens (the field is never used in search queries so
 matching is not an issue).  I've been looking at SynonymFilterFactory as a
 possible solution to this problem but haven't been able to work out the
 specifics of how to configure it for phrase mappings.

SynonymFilter works out-of-the-box with multi-token synonyms...

Microsoft Office = microsoft_office
Bill Gates, William Gates = bill_gates

Just don't use a word-delimiter filter if you use underscore to join words.

-Yonik

Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler

On Thu, 2007-09-20 at 15:27 +0200, Bertrand Delacretaz wrote:
 On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote:
 
  ...Thank you very much. Moving the filter class=
  solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it
 
 Yes, the problem was the EnglishPorterFilterFactory before the accents
 removal: the stemmer doesn't know about accents, so no stemming
 occured on matthé whereas matthe was stemmed to matth.
 
 BTW, your rené example makes me think you're indexing french, if
 that's the case you might want to use a stemmer configured for that
 language, for example
 
 filter
   class=Solr.SnowballPorterFilterFactory
   language=French/

Betrand, does the French Snowball work fine?

A colleague of mine exchanged mails with Porter about the Spanish filter
and he came to the conclusion that it is not really working well for
Spanish:

So -orio on the whole changes meaning too much (acceso = access,
accessorio = accessory differ as much in Spanish as English; -atorio
similarly (aclarar to  rinse, clear (in a very general sense), brighten
up; aclaratorio = explanatory). 

Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote,
-isimo are in this category. 

-al and -iz look like plausible candidates for ending removal, but,
unlike their English counterparts, removing them makes little difference
or improvement. Similarly with -ion removal after -s. 

There is a difficulty with pure vowel endings, and the stemmer can't
always get this right. So in English 'academic' is stemmed to 'academ'
but 'academy' does not lose the final -y (or -i). This explains the
residual vowels with -io, -ia 
endings etc.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: How can i make a distribute search on Solr?

2007-09-20 Thread Norberto Meijome

On Thu, 20 Sep 2007 09:58:17 +0200
David Welton [EMAIL PROTECTED] wrote:

 That seems to be how Sphinx works:
 
 http://www.sphinxsearch.com/doc.html#distributed
 
 Of course, the details of this are far over my head for either system,
 so I don't really know if that's a sensible way of doing things or
 not.

thanks for the pointer. it does seem that it's pretty much what I had in
mind... but it doesn't seem to be based on Lucene (which I particular like,
specially for the community...) ... 

cheers,

_
{Beto|Norberto|Numard} Meijome

The freethinking of one age is the common sense of the next.
   Matthew Arnold

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: How can i make a distribute search on Solr?

2007-09-20 Thread Norberto Meijome

On Thu, 20 Sep 2007 09:53:46 -0400
Yonik Seeley [EMAIL PROTECTED] wrote:

 On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
  Maybe I got this wrong...but isn't this what mapreduce is meant to deal 
  with?
 
 Not really... you could force a *lot* of different problems into
 map-reduce (that's sort of the point... being able to automatically
 parallelize a lot of different problems).  It really isn't the best
 fit though, and would end up being much slower than a custom job.

good point..i wondered whether the whole sorting/whatever wasn't going to make
it far slower than something custom. I dont care about mapreduce in particular,
but yes the effect - n indexers / searches all fulfilling their part of the
overall search results.

 Then there is the issue that the way map-reduce is implemented (like
 hadoop) is also tuned for longer running batch jobs on huge data
 (temporary files are used, external sorts, initial input, final output
 is via files, etc).  

I see, didn't know this.

 Check out the google map-reduce paper - they
 don't use it for their search side either.

yeah, need to  :) 

 Things are already progressing in the distributed search area:
 https://issues.apache.org/jira/browse/SOLR-303
 Hopefully I'll have time to dig into it more myself in a few weeks.

excellent , thanks 
_
{Beto|Norberto|Numard} Meijome

He uses statistics as a drunken man uses lamp-posts ... for support rather
than illumination. Andrew Lang (1844-1912)

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

RE: Index/Update Problems with Solrj/Tomcat and Larger Files

2007-09-20 Thread Daley, Kristopher M.

I am running against 1.2.  Where would I get the 1.3-dev version?  

I will try different versions of Tomcat and/or Jetty.  Thanks for all
your suggestions, I'll let you know.

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 19, 2007 8:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Index/Update Problems with Solrj/Tomcat and Larger Files

 However, if I go to the tomcat server and restart it after I have
issued
 the process command, the program returns and the documents are all
 posted correctly!

 Very strange behavioram I somehow not closing the connection
 properly?  

What version is the solr you are connecting to? 1.2 or 1.3-dev?  (I have

not tested against 1.2)

Does this only happen with tomcat?  If you run with jetty do you get the

same behavior?  (again, just stabs in the dark)

If you can make a small repeatable problem, post it in JIRA and I'll 
look into it.

ryan

Re: Strange behavior when searching with accents

2007-09-20 Thread Bertrand Delacretaz

On 9/20/07, Thorsten Scherler [EMAIL PROTECTED] wrote:
 ...Betrand, does the French Snowball work fine?...

I've seen some weirdnesses, like tennis and tenir (means to hold)
both stemmed to ten, but in all of our (simple) tests it was ok.

The application where we're using it does not require high precision
though, so it looked good enough and we didn't do create very
extensive tests for it.

-Bertrand

Solr and FieldCache

2007-09-20 Thread Walter Ferrara

I have an index with several fields, but just one stored: ID (string,
unique).
I need to access that ID field for each of the tops nodes docs in my
results (this is done inside a handler I wrote), code looks like:

 Hits hits = searcher.search(query);
 for(int i=0; inodes; i++) {
id[i]=hits.doc(i).get(ID);
score[i]=hits.score(i);
 }

I noticed that retrieving the code is slow.

if I use the FieldCache, like:
id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(),
ID)[hits.id(i)];
after the first execution (the initialization of the cache take some
times), it seems to run much faster.

But what happens when SOLR reload  the index (after a commit, or an
optimize for example)?
Will it refresh the cache with new reader (in the warmup process?), or
it will be the first query execution of that code (with the new reader)
that will force the refresh? (this could mean that every first query
after a reload will be slower)
Is there any way to tell SOLR to cache and warmup when needed this ID
field?
 
Thanks,
Walter

Re: Solr and FieldCache

2007-09-20 Thread J.J. Larrea

At 5:30 PM +0200 9/20/07, Walter Ferrara wrote:
I have an index with several fields, but just one stored: ID (string,
unique).
I need to access that ID field for each of the tops nodes docs in my
results (this is done inside a handler I wrote), code looks like:

 Hits hits = searcher.search(query);
 for(int i=0; inodes; i++) {
id[i]=hits.doc(i).get(ID);
score[i]=hits.score(i);
 }

I noticed that retrieving the code is slow.

if I use the FieldCache, like:
id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(),
ID)[hits.id(i)];

I assume you're putting FieldCache.DEFAULT.getStrings(searcher.getReader(),
ID) in an array outside the loop, saving 2 redundant method calls per 
iteration.

after the first execution (the initialization of the cache take some
times), it seems to run much faster.

Do note that FieldCache.DEFAULT is caching the indexed values, not the stored 
values.  Since your field is an ID you are probably indexing it in such a way 
that both are identical, e.g. with KeywordTokenizer, so you're not seeing a 
difference.

But what happens when SOLR reload  the index (after a commit, or an
optimize for example)?
Will it refresh the cache with new reader (in the warmup process?), or
it will be the first query execution of that code (with the new reader)
that will force the refresh? (this could mean that every first query
after a reload will be slower)

It is refreshed by Lucene the first time the FieldCache array is requested from 
the new IndexReader.

Is there any way to tell SOLR to cache and warmup when needed this ID
field?

Absolutely, just put a warmup query in solrconfig.xml which makes request that 
invokes FieldCache.DEFAULT.getStrings on that field.

Simplest would probably be to invoke your custom handler, perhaps passing 
arguments that limit it to only processing one document to limit the data which 
gets cached; since getStrings returns the entire array, one pass through your 
loop is fine.

If that's not easy with your handler, you could achieve the same effect by 
setting up a handler which facets on the ID field, sorting by ID 
(facet.sort=false), and only asks for a single value (facet.limit=1) (the 
entire id[docid] array will get scanned to count references to that ID, but 
that ensures it gets paged in).

- J.J.

Re: Solr and FieldCache

2007-09-20 Thread Walter Ferrara

About stored/index difference: ID is a string, (= solr.StrField) so
FieldCache give me what I need.

I'm just wondering, as this cached object could be (theoretically)
pretty big, do I need to be aware of some OOM? I know that FieldCache
use weakmaps, so I presume the cached array for the older reader(s) will
be gc-ed when the reader is no longer referenced (i.e. when solr load
the new one, after its warmup and so on), is that right?

Thanks
--

J.J. Larrea wrote:
 At 5:30 PM +0200 9/20/07, Walter Ferrara wrote:
   
 I have an index with several fields, but just one stored: ID (string,
 unique).
 I need to access that ID field for each of the tops nodes docs in my
 results (this is done inside a handler I wrote), code looks like:

 Hits hits = searcher.search(query);
 for(int i=0; inodes; i++) {
id[i]=hits.doc(i).get(ID);
score[i]=hits.score(i);
 }

 I noticed that retrieving the code is slow.

 if I use the FieldCache, like:
 id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(),
 ID)[hits.id(i)];
 

 I assume you're putting FieldCache.DEFAULT.getStrings(searcher.getReader(),
 ID) in an array outside the loop, saving 2 redundant method calls per 
 iteration.

   
 after the first execution (the initialization of the cache take some
 times), it seems to run much faster.
 

 Do note that FieldCache.DEFAULT is caching the indexed values, not the stored 
 values.  Since your field is an ID you are probably indexing it in such a way 
 that both are identical, e.g. with KeywordTokenizer, so you're not seeing a 
 difference.

   
 But what happens when SOLR reload  the index (after a commit, or an
 optimize for example)?
 Will it refresh the cache with new reader (in the warmup process?), or
 it will be the first query execution of that code (with the new reader)
 that will force the refresh? (this could mean that every first query
 after a reload will be slower)
 

 It is refreshed by Lucene the first time the FieldCache array is requested 
 from the new IndexReader.

   
 Is there any way to tell SOLR to cache and warmup when needed this ID
 field?
 

 Absolutely, just put a warmup query in solrconfig.xml which makes request 
 that invokes FieldCache.DEFAULT.getStrings on that field.

 Simplest would probably be to invoke your custom handler, perhaps passing 
 arguments that limit it to only processing one document to limit the data 
 which gets cached; since getStrings returns the entire array, one pass 
 through your loop is fine.

 If that's not easy with your handler, you could achieve the same effect by 
 setting up a handler which facets on the ID field, sorting by ID 
 (facet.sort=false), and only asks for a single value (facet.limit=1) (the 
 entire id[docid] array will get scanned to count references to that ID, but 
 that ensures it gets paged in).

 - J.J.

Faceting question

2007-09-20 Thread Cric Digs

I've been struggling with this a bit so here goes:

I'm using faceting to get some results. I also want to get another field -
the id field along with it. Is it possible to get that somehow in the facet
results?

Thanks!

Re: Solr and FieldCache

2007-09-20 Thread Yonik Seeley

On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote:
 I'm just wondering, as this cached object could be (theoretically)
 pretty big, do I need to be aware of some OOM? I know that FieldCache
 use weakmaps, so I presume the cached array for the older reader(s) will
 be gc-ed when the reader is no longer referenced (i.e. when solr load
 the new one, after its warmup and so on), is that right?

Right.  You will need room for two entries (one for the current
searcher and one for the warming searcher).

-Yonik

Re: rsync start and enable for multiple solr instances within one tomcat

2007-09-20 Thread Yu-Hui Jin

Ok, I should correct myself. For #1, I think we need to

1) config different port for each solr home dir (since they run on the same
host);
2) run rsync-start script under each of the solr home's bin dir.
(btw, just to make clear, we should run rsync-start after rsync-enable that
I understand.)


Can someone confirm my understanding? Does the #3 question suggests a
hard-coded solr that shouldn't be?


Thanks,

-Hui



On 9/19/07, Yu-Hui Jin [EMAIL PROTECTED] wrote:

 Hi, there,

 So we are using the Tomcat's JNDI method to set up multiple solr instances
 within a tomcat server. Each instance has a solr home directory.

 Now we want to set up collection distribution for all these solr home
 indexes. My understanding is:

 1.  we only need to run rsync-start once use the script under any of the
 solr home dirs.
 2.  we need to run each of the rsync-enable scripts under the solr home's
 bin dirs.
 3.  the twiki page at
 http://wiki.apache.org/solr/SolrCollectionDistributionScripts  keeps
 refering to solr/xxx. Is this solr the example solr home dir?  If so,
 would it be hard-coded in any of the scripts?  For example, I saw in
 snappuller line 226 (solr 1.2):

 ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/
 ${data_dir}/${name}-wip

 Is the above solr a hard-coded solr home name? If so, it's not desirable
 since we have multiple solr homes with different names.  If not, what is
 this solr?


 thanks,

 -Hui




-- 
Regards,

-Hui

Re: rsync start and enable for multiple solr instances within one tomcat

2007-09-20 Thread Chris Hostetter


: 1) config different port for each solr home dir (since they run on the same
: host);

you mean a differnet rsync port right? ... yes the scripts as distributed 
assume that each rsync daemon will be dedicated to a single solr 
instance .. the idea beaing that even if you have 12 Solr intances 
running on one servlet container port, you have 12 seperate rsync ports so 
you can start/stop enable/disable them independently when doing index 
rebuilds, etc...

: 2) run rsync-start script under each of the solr home's bin dir.
: (btw, just to make clear, we should run rsync-start after rsync-enable that
: I understand.)

correct, rsyncd-enable just sets the flag file so that rsyncd-start will 
function ... the idea being that you can install rsyncd-start in such a 
way that it will run whenever your port is startup, or whenever you box is 
booted, but disable it from happening without removing the script from 
those places.

: Can someone confirm my understanding? Does the #3 question suggests a
: hard-coded solr that shouldn't be?

solr/conf, solr/bin, solr/data, solr/logs ... all assume your solr home 
directory is named solr/, but that's not a requirement.  It's a pretty 
pervasive documentation shortcut that could be changed if osmeone wanted 
to be systematic about it, but I don't think it's all that bad since 
that's a decent common case



-Hoss

Re: Solr and FieldCache

2007-09-20 Thread Yonik Seeley

On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote:
 I have an index with several fields, but just one stored: ID (string,
 unique).
 I need to access that ID field for each of the tops nodes docs in my
 results (this is done inside a handler I wrote), code looks like:

  Hits hits = searcher.search(query);
  for(int i=0; inodes; i++) {
 id[i]=hits.doc(i).get(ID);
 score[i]=hits.score(i);
  }

What is the higher level use-case you are trying to address that makes
it necessary to write a plugin?

-Yonik

Re: rsync start and enable for multiple solr instances within one tomcat

2007-09-20 Thread Yu-Hui Jin

Thanks, Hoss.

For the last question, yes I understand now it's referring to whatever solr
home we have named.  However, there's still the last part of my question
that feels suspicious why the solr string is directly coded in the script
(unlike other cases they usually use ${solr_root} to get to specific dirs.
)   I pasted this line again below:

I saw in snappuller line 226 (solr 1.2):

${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/
${data_dir}/${name}-wip

Is the above solr a hard-coded solr home name? If so, it's not desirable
since we have multiple solr homes with different names.  If not, what is
this solr?



Thanks,
-Hui




On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:


 : 1) config different port for each solr home dir (since they run on the
 same
 : host);

 you mean a differnet rsync port right? ... yes the scripts as distributed
 assume that each rsync daemon will be dedicated to a single solr
 instance .. the idea beaing that even if you have 12 Solr intances
 running on one servlet container port, you have 12 seperate rsync ports so
 you can start/stop enable/disable them independently when doing index
 rebuilds, etc...

 : 2) run rsync-start script under each of the solr home's bin dir.
 : (btw, just to make clear, we should run rsync-start after rsync-enable
 that
 : I understand.)

 correct, rsyncd-enable just sets the flag file so that rsyncd-start will
 function ... the idea being that you can install rsyncd-start in such a
 way that it will run whenever your port is startup, or whenever you box is
 booted, but disable it from happening without removing the script from
 those places.

 : Can someone confirm my understanding? Does the #3 question suggests a
 : hard-coded solr that shouldn't be?

 solr/conf, solr/bin, solr/data, solr/logs ... all assume your solr home
 directory is named solr/, but that's not a requirement.  It's a pretty
 pervasive documentation shortcut that could be changed if osmeone wanted
 to be systematic about it, but I don't think it's all that bad since
 that's a decent common case



 -Hoss




-- 
Regards,

-Hui

Re: Faceting question

2007-09-20 Thread Chris Hostetter


: I'm using faceting to get some results. I also want to get another field -
: the id field along with it. Is it possible to get that somehow in the facet
: results?

you're going to have to elaborate on what it is you are trying to do ... i 
genuinely have no idea what you are asking (and i think i'm usually pretty 
good at reading between the lines and guessing what people mean).



-Hoss

RE: Faceting question

2007-09-20 Thread Binkley, Peter

You mean, when it says that facet term foo has 10 documents, you want
those 10 ids? I think that will require a further query from your
application.

Peter 

-Original Message-
From: Cric Digs [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 20, 2007 12:43 PM
To: solr-user@lucene.apache.org
Subject: Faceting question

I've been struggling with this a bit so here goes:

I'm using faceting to get some results. I also want to get another field
- the id field along with it. Is it possible to get that somehow in the
facet results?

Thanks!

Re: rsync start and enable for multiple solr instances within one tomcat

2007-09-20 Thread Yu-Hui Jin

ok. Hoss. I think I'll believe you since nobody raised any issue running the
script.  And I'm about to try it out shortly with different solr home names.

So just to help my knowledge, where does this virtual setting of this solr
string happen? Should it be in some config file or sth?


thanks,

-Hui



On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 : home we have named.  However, there's still the last part of my question
 : that feels suspicious why the solr string is directly coded in the
 script
 : (unlike other cases they usually use ${solr_root} to get to specific
 dirs.
 : )   I pasted this line again below:

 sorry ... i didn't realize you were talking about the script, i thought
 you were talking aboutthe docs.

 : I saw in snappuller line 226 (solr 1.2):
 :
 : ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/
 : ${data_dir}/${name}-wip
 :
 : Is the above solr a hard-coded solr home name? If so, it's not
 desirable

 I'm not 100% positive, but I believe that is just an arbitrary virtual
 path relative the root of the rsyncd server ... it could be anything, as
 long as snappuller and the rsyncd agree on what it is, so it's hardcoded
 to be solr.

 If we used ${solr_root} then the slaves and the master would have to use
 teh exact same solr home directory.


 -Hoss




-- 
Regards,

-Hui

Re: rsync start and enable for multiple solr instances within one tomcat

2007-09-20 Thread Chris Hostetter


: So just to help my knowledge, where does this virtual setting of this solr
: string happen? Should it be in some config file or sth?

rsyncd-start creates an rsync config file on the fly ... much of it is 
constants, but it fills in the rsync port using a variable from your 
config.




-Hoss

Re: rsync start and enable for multiple solr instances within one tomcat

2007-09-20 Thread Bill Au

The solr that you are referring to in your third question in the
name of the rsync area which is map to the solr data directory.  This
is defined in the rsyncd configuration file which is generated on the
fly as Chris has pointed out.  Take a look at rsyncd-start.

snappuller rsync the index from this 'solr' area (the command you have
quoted) on the master.  The name of the rsync area had nothing to do
with the name of the index.  We set up this area for rsyncd so that
one is restricted within this area when trying to access files on the
master going through rsyncd.

The name of the rsyncd area does not have to be 'solr'.  It can be
anything as long as the value in rsyncd-start matches the value in
snappuller.

Bill

On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 : So just to help my knowledge, where does this virtual setting of this solr
 : string happen? Should it be in some config file or sth?

 rsyncd-start creates an rsync config file on the fly ... much of it is
 constants, but it fills in the rsync port using a variable from your
 config.




 -Hoss

Re: a bug in commit script?

2007-09-20 Thread Bill Au

That would be my bad.  I noticed the problem while fixing SOLR-282
which is not related.  I fixed both problems in stead of opening a
different bug for the response format issue.  I will update the change
log.

Bill

On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 :
 : It seems there's a small bug in the bin/commit script for solr 1.2.

 A fix was already commited to the trunk for this as part of SOLR-282 (but
 there doesn't seem to be a note about it in the changelog)


 -Hoss

clarification needed for the Ranking score

2007-09-20 Thread Dilip.TS

Hi,
I need a clarification regarding the SOLR Ranking.


consider the scenario  for searching for courses based on following
relevance:

a.  Courses with the term in the courseTitle, courseTag and in the
courseDescription would appear first
b.  Courses with the term in the courseTitle and in the courseDescription
would appear next
c.  Courses with the term only in the courseTitle appear next.
d.  Courses with the term only in the courseDescription appear next.
e.  Courses with the term only in the courseTag appear last.

 Let me know if my understanding is correct with the following solution

 + (basequery) courseTitle^1 courseTag^1000 courseDescription^100;
courseTitle asc,  courseDescription asc,courseTag asc;

How do we set the relevancy while performing a search? is there any
configuration to set it in the solrconfig files?
Also how do we set the Term Proximity?
Could you clarify?

Thanks in advance


Regards,
Dilip TS

38 matches

Mail list logo