Re: Using Chinese / How to ?

2009-06-03 Thread James liu
1: modify ur schema.xml:
like
fieldtype name=text_cn class=solr.TextField
analyzer class=chineseAnalyzer/
analyzer

2: add your field:
field name=urfield type=text_cn indexd=true stored=true/

3: add your analyzer to {solr_dir}\lib\

4: rebuild newsolr and u will find it in {solr_dir}\dist

5: follow tutorial to setup solr

6: open your browser to solr admin page, find analyzer to check analyzer, it
will tell u how to analyzer world, use which analyzer


-- 
regards
j.L ( I live in Shanghai, China)


Re: Dismax handler phrase matching question

2009-06-03 Thread Shalin Shekhar Mangar
On Wed, Jun 3, 2009 at 1:59 AM, anuvenk anuvenkat...@hotmail.com wrote:


 I have to search over multiple fields so passing everything in the 'q'
 might
 not be neat. Can something be done with the facet.query to accomplish this.
 I'm using the facet parameters. I'm not familiar with java so not sure if a
 function query could be used to accomplish this. Any other thoughts?


I don't think facet.query and function queries have anything to do with
this. Using the dismax params seem to be the right way.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Phrase query search returns no result

2009-06-03 Thread SergeyG

Yes, Erick, I did. Actually the course of events was as follows. I started
with the example config files (solrconfig.xml  schema.xml) and added my own
fields. In my search I have 2 clauses: for a phrase and for a set of
keywords. And from the very beginning it worked fine. Until on the second
day one phrase (It was as long as a tree) gave me back the wrong response.
Trying to find the reason I started changing different parameters one by one
(field types - from text to string and back, copyfields, analyzers, etc.).
The result - I came to the situation when all the queries returned only
wrong responses. During my research I deleted all indexed xml files
several times what, in theory, should have cleaned up the index itself (as I
understand it). And then I decided to start all over again. The only two
differences from the very beginning was that I turned the StopWordsFilter
off (although I did it several times while playing with params; besides, the
phrase that initially caused troubles doesn't consists only of the stop
words) and also, I commented out copyField declarations for my own fields. 
I'm still wondering what happened.

Thank you,
Sergey


Erick Erickson wrote:
 
 Did you by any chance change your schema? Rename a field? Change your
 analyzers? etc? between the time you originally
 generated your index and blowing it away?
 
 I'm wondering if blowing away your index and regenerating just
 caused any changes in how you index/search to get picked
 up...
 
 Best
 Erick
 
 On Tue, Jun 2, 2009 at 3:28 PM, SergeyG sgoldb...@mail.ru wrote:
 

 Hmmm... It looks a bit magic. After 3 days of experimenting with various
 parameters and getting only wrong results, I deleted all the indexed data
 and left the minimum set of parameters: qs=default (I omitted it),
 StopWords=off (StopWordsFilter was commented out), no copyFields,
 requestHandler=standard. And guess what - it started producing the
 expected
 results! :) So for me the question remains: what was the cause of all the
 previous trouble?
 Anyway, thanks for the discussion.


 SergeyG wrote:
 
  Actually, my phrase here~0 (for an exact match) didn't work I tried,
  just for to experiment, to put qs=100.
 
  Otis Gospodnetic wrote:
 
 
  And your phrase here~100 works?
 
   Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: SergeyG sgoldb...@mail.ru
  To: solr-user@lucene.apache.org
  Sent: Tuesday, June 2, 2009 11:17:23 AM
  Subject: Re: Phrase query search returns no result
 
 
  Thanks, Otis.
 
  Checking for the stop words was the first thing I did after getting
 the
  empty result. Not all of those words are in the stopwords.txt file.
 Then
  just for experimenting purposes I commented out the StopWordsAnalyser
  during
  indexing and reindexed. But the phrase was not found again.
 
  Sergey
 
 
  Otis Gospodnetic wrote:
  
  
   Your stopwords were removed during indexing, so if all those terms
  were
   stopwords, and they likely were, none of them exist in the index
 now.
  You
   can double-check that with Luke.  You need to remove stopwords from
  the
   index-time analyzer, too, and then reindex.
  
Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: SergeyG
   To: solr-user@lucene.apache.org
   Sent: Tuesday, June 2, 2009 9:57:17 AM
   Subject: Phrase query search returns no result
  
  
   Hi,
  
   I'm trying to implement a full-text search but can't get the right
  result
   with a Phrase query search. The field I search through was indexed
 as
  a
   text field. The phrase was It was as long as a tree. During
 both
   indexing and searching the StopWordsFiler was on. For a search I
 used
   these
   settings:
  
  
 dismax
 explicit
  
 title author category content
  
  
 id,title,author,isbn,category,content,score
  
 100
 content
  
  
  
   But I the returned docs list was empty. Using Solr Admin console
 for
   debugging showed that parsedquery=+() ().
   Switching the StopwordsFilter off during searching didn't help
  either.
  
   Am I missing something?
  
   Thanks,
   Sergey
   --
   View this message in context:
  
 
 http://www.nabble.com/Phrase-query-search-returns-no-result-tp23833024p23833024.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
  
  
 
  --
  View this message in context:
 
 http://www.nabble.com/Phrase-query-search-returns-no-result-tp23833024p23834693.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Phrase-query-search-returns-no-result-tp23833024p23839134.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Phrase-query-search-returns-no-result-tp23833024p23846362.html
Sent from the Solr - User mailing list archive at Nabble.com.



How contrib for solr memcache query cache

2009-06-03 Thread chenlbya

Hi all:
 
I want to contrib memcache implement solr cache (only test query result cache)
 
patch for solr 1.3 http://code.google.com/p/solr-side/issues/detail?id=1can=1 
 
solr-memcache.zip http://solr-side.googlecode.com/files/solr-memcache.zip 
 
=readme.txt=
 

MemcachedCache instead of solr queryresultCache (default LRUCache)
 

config in solrconfig.xml to use solr-memcache

add newSearcher and firstSearcher Listener, such as:
 


listener event=newSearcher class=solr.MemcachedCache /
listener event=firstSearcher class=solr.MemcachedCache /
 

use listener only for get index version, to create memcached key

indexVersion is static long field of MemcachedCache.java.
 

//originalKey is QueryResultKey
memcached key = keyPrefix+indexVersion+-+originalKey.hashCode() 
 


!--

MemcachedCache params:
 

memcachedHosts (required), , split.
name (optional) no default.
expTime (optional) default 1800 s (= 30 minute)
defaultPort (optional) default 11211
keyPrefix (optional) default 
 

--

queryResultCache

class=solr.MemcachedCache
memcachedHosts=192.168.0.100,192.168.0.101:1234,192.168.0.103
expTime=21600
defaultPort=11511
keyPrefix=shard-1-/
 

dep jar:

memcached-2.2.jar
spy-2.4.jar
 

solr-memcache.patch for solr 1.3
 

if download and unzip to d:/apache-solr-1.3.0
 

copy patch-build.xml and solr-memcache.patch to (d:/apache-solr-1.3.0)
 


D:\apache-solr-1.3.0ant -f patch-build.xml -Dpatch.file=solr-memcache.patch
Buildfile: patch-build.xml

apply-patch:
[patch] patching file src/java/org/apache/solr/search/DocSet.java

BUILD SUCCESSFUL
Total time: 0 seconds
 

if exist d:/apache-solr-1.3.0/contrib/solr-memcache, if no exist you can unzip solr-memcache.zip to that dir

and dist
 


D:\apache-solr-1.3.0ant dist
...
 

look D:\apache-solr-1.3.0\dist\apache-solr-memcache-1.3.0.jar
 


  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Re: How contrib for solr memcache query cache

2009-06-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
please raise this as an issue in Jira

https://issues.apache.org/jira/browse/SOLR

let us see what others think about this

On Wed, Jun 3, 2009 at 1:14 PM, chenl...@yahoo.com.cn wrote:

 Hi all:

 I want to contrib memcache implement solr cache (only test query result cache)

 patch for solr 1.3 http://code.google.com/p/solr-side/issues/detail?id=1can=1

 solr-memcache.zip http://solr-side.googlecode.com/files/solr-memcache.zip

 =readme.txt=


 MemcachedCache instead of solr queryresultCache (default LRUCache)


 config in solrconfig.xml to use solr-memcache

 add newSearcher and firstSearcher Listener, such as:



 listener event=newSearcher class=solr.MemcachedCache /
 listener event=firstSearcher class=solr.MemcachedCache /


 use listener only for get index version, to create memcached key

 indexVersion is static long field of MemcachedCache.java.


 //originalKey is QueryResultKey
 memcached key = keyPrefix+indexVersion+-+originalKey.hashCode()



 !--

 MemcachedCache params:


 memcachedHosts (required), , split.
 name (optional) no default.
 expTime (optional) default 1800 s (= 30 minute)
 defaultPort (optional) default 11211
 keyPrefix (optional) default 


 --

 queryResultCache

 class=solr.MemcachedCache
 memcachedHosts=192.168.0.100,192.168.0.101:1234,192.168.0.103
 expTime=21600
 defaultPort=11511
 keyPrefix=shard-1-/


 dep jar:

 memcached-2.2.jar
 spy-2.4.jar


 solr-memcache.patch for solr 1.3


 if download and unzip to d:/apache-solr-1.3.0


 copy patch-build.xml and solr-memcache.patch to (d:/apache-solr-1.3.0)



 D:\apache-solr-1.3.0ant -f patch-build.xml -Dpatch.file=solr-memcache.patch
 Buildfile: patch-build.xml

 apply-patch:
 [patch] patching file src/java/org/apache/solr/search/DocSet.java

 BUILD SUCCESSFUL
 Total time: 0 seconds


 if exist d:/apache-solr-1.3.0/contrib/solr-memcache, if no exist you can 
 unzip solr-memcache.zip to that dir

 and dist



 D:\apache-solr-1.3.0ant dist
 ...


 look D:\apache-solr-1.3.0\dist\apache-solr-memcache-1.3.0.jar



  ___
  好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/


--
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: fq vs. q

2009-06-03 Thread Marc Sturlese

It's definitely not proper documentation but maybe can give you a hand:
http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/


Martin Davidsson-2 wrote:
 
 I've tried to read up on how to decide, when writing a query, what  
 criteria goes in the q parameter and what goes in the fq parameter, to  
 achieve optimal performance. Is there some documentation that  
 describes how each field is treated internally, or even better, some  
 kind of rule of thumb to help me decide how to split things up when  
 querying against one or more fields. In most cases, I'm looking for  
 exact matches but sometimes an occasional wildcard query shows up too.  
 Thank you!
 
 -- Martin
 
 

-- 
View this message in context: 
http://www.nabble.com/fq-vs.-q-tp23845282p23847845.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: fq vs. q

2009-06-03 Thread Anshuman Manur
wow! that was a good read!!!

On Wed, Jun 3, 2009 at 2:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 It's definitely not proper documentation but maybe can give you a hand:

 http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/


 Martin Davidsson-2 wrote:
 
  I've tried to read up on how to decide, when writing a query, what
  criteria goes in the q parameter and what goes in the fq parameter, to
  achieve optimal performance. Is there some documentation that
  describes how each field is treated internally, or even better, some
  kind of rule of thumb to help me decide how to split things up when
  querying against one or more fields. In most cases, I'm looking for
  exact matches but sometimes an occasional wildcard query shows up too.
  Thank you!
 
  -- Martin
 
 

 --
 View this message in context:
 http://www.nabble.com/fq-vs.-q-tp23845282p23847845.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: How contrib for solr memcache query cache

2009-06-03 Thread 林彬 陈
 
https://issues.apache.org/jira/browse/SOLR-1197


--- 09年6月3日,周三, chenl...@yahoo.com.cn chenl...@yahoo.com.cn 写道:


发件人: chenl...@yahoo.com.cn chenl...@yahoo.com.cn
主题: How contrib for solr memcache query cache
收件人: solr-user@lucene.apache.org
日期: 2009年6月3日,周三,下午3:44



Hi all:
 
I want to contrib memcache implement solr cache (only test query result cache)
 
patch for solr 1.3 http://code.google.com/p/solr-side/issues/detail?id=1can=1 
 
solr-memcache.zip http://solr-side.googlecode.com/files/solr-memcache.zip 
 
=readme.txt=
 

MemcachedCache instead of solr queryresultCache (default LRUCache)
 

config in solrconfig.xml to use solr-memcache

add newSearcher and firstSearcher Listener, such as:
 


listener event=newSearcher class=solr.MemcachedCache /
listener event=firstSearcher class=solr.MemcachedCache /
 

use listener only for get index version, to create memcached key

indexVersion is static long field of MemcachedCache.java.
 

//originalKey is QueryResultKey
memcached key = keyPrefix+indexVersion+-+originalKey.hashCode() 
 


!--

MemcachedCache params:
 

memcachedHosts (required), , split.
name (optional) no default.
expTime (optional) default 1800 s (= 30 minute)
defaultPort (optional) default 11211
keyPrefix (optional) default 
 

--

queryResultCache

class=solr.MemcachedCache
memcachedHosts=192.168.0.100,192.168.0.101:1234,192.168.0.103
expTime=21600
defaultPort=11511
keyPrefix=shard-1-/
 

dep jar:

memcached-2.2.jar
spy-2.4.jar
 

solr-memcache.patch for solr 1.3
 

if download and unzip to d:/apache-solr-1.3.0
 

copy patch-build.xml and solr-memcache.patch to (d:/apache-solr-1.3.0)
 


D:\apache-solr-1.3.0ant -f patch-build.xml -Dpatch.file=solr-memcache.patch
Buildfile: patch-build.xml

apply-patch:
[patch] patching file src/java/org/apache/solr/search/DocSet.java

BUILD SUCCESSFUL
Total time: 0 seconds
 

if exist d:/apache-solr-1.3.0/contrib/solr-memcache, if no exist you can unzip solr-memcache.zip to that dir

and dist
 


D:\apache-solr-1.3.0ant dist
...
 

look D:\apache-solr-1.3.0\dist\apache-solr-memcache-1.3.0.jar
 


      ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

indexing/crawling HTML + solr

2009-06-03 Thread Gena Batsyan

Hi!

to be short, where to start with the subject?

Any pointers to some [semi-]functional solutions that crawl the web as a 
normal crawler, take care about html parsing, etc, and feed the crawled 
stuff as solr-documents per add  ?


regards!




Alphabetical index for faceting

2009-06-03 Thread Bertrand Mathieu
Hello,

My goal is to get an index for alphabetical faceting of titles. For this I'm
trying to define a fieldType meant to index first letter of text, with
stopwords removed. My problem is that without WordDelimiterFilterFactory
stopwords are not removed, and with it I end up with 2 tokens (and I'd like
to keep just the first one).

For example, the string The Curse of Monkey Island should be indexed as
c.

Here is my field type definition as of now:

fieldType name=alphabetical class=solr.TextField
sortMissingLast=true
   omitNorms=true

  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_fr.txt/
filter class=solr.TrimFilterFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.PatternReplaceFilterFactory
pattern=([0-9a-z]).* replacement=$1 replace=all /
  /analyzer

/fieldType

With my example it gives with 3 tokens: c, m, i.

I have not been able to find any documentation related to what I want to do
(wrong keywords in google?). At this point I'm beginning to think that I
will have to write a custom filter that would replace the
patternreplacefilterfactory: it would keep the first character of the first
token and discard everything else. Unfortunatly I have not programmed with
java for years, so I try to avoid that solution if possible.

And since I don't see my need as something as uncommon, I am wondering what
I am missing. Any idea?

-- 
Bertrand Mathieu


Re: indexing/crawling HTML + solr

2009-06-03 Thread Otis Gospodnetic

Gena,

Besides droids (simpler, smaller components you can put together) there is also 
Nutch, a bigger beast for large scale crawling that index crawled pages into 
Solr - http://lucene.apache.org/nutch .

Otis


- Original Message 
 From: Gena Batsyan gbat...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 6:09:36 AM
 Subject: indexing/crawling HTML + solr
 
 Hi!
 
 to be short, where to start with the subject?
 
 Any pointers to some [semi-]functional solutions that crawl the web as a 
 normal 
 crawler, take care about html parsing, etc, and feed the crawled stuff as 
 solr-documents per   ?
 
 regards!



Re: How to avoid space on facet field

2009-06-03 Thread Bny Jo
Anshuman, thanks for you input. I will try that, I can understand what you are 
trying.  

Marcus, I did not understand  how your KeyworkTokenizer work. Is that I have to 
define a septate field like what we have in example schema and call that field. 
This what I came up with.

 fieldType name=facet_tex class=solr.TextField sortMissingLast=true 
omitNorms=true
  analyzer

tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory /
!-- The TrimFilter removes any leading or trailing whitespace --
filter class=solr.TrimFilterFactory /
   
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z]) replacement= replace=all
/
  /analyzer
/fieldType



Thanks

Boney



From: Marc Sturlese marc.sturl...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wednesday, June 3, 2009 3:45:49 AM
Subject: Re: How to avoid space on facet field


You can configure a facet_text instead of the normal text type. There you
use KeyWordTokenizer instead of StandardTokenizer. One of the advantages of
using it instead of string is that it will allow you to use synonyms,
stopwords and filters and all the properties from an analyzer.


Anshuman Manur wrote:
 
 Hey,
 
 From what you have written I'm guessing that in your schema.xml file, you
 have defined the field manu to be of type  text, which is good for
 keyword
 searches, as the text type indexes on whitespace, i.e. Dell Inc. is
 indexed
 as dell, inc. so keyword searches matches either dell or inc. But when you
 want to facet on a particular field, you want exact matches regardless of
 whitespace in between. In such cases its a good idea to use the string
 type.
 Let me illustrate with an example based on my settings:
 
 Here are my fields:
 
!-- Core Fields --
field name=id type=string indexed=true stored=true
 required=true /
field name=name type=text indexed=true stored=true/
field name=manu type=text indexed=true stored=true/
field name=sport type=text indexed=true stored=true /
field name=type type=text indexed=true stored=true /
field name=desc type=text indexed=true stored=true /
field name=ldesc type=text indexed=true stored=true /
 
!-- default text Field for searching --
field name=text type=text indexed=true stored=false
 multiValued=true/
 
!-- exact string fields for faceting --
field name=sport_exact type=string indexed=true stored=false
 /
field name=manu_exact type=string indexed=true stored=false /
field name=type_exact type=string indexed=true stored=false /
 
copyField source=manu dest=text/
copyField source=name dest=text/
copyField source=sport dest=text/
copyField source=desc dest=text/
copyField source=ldesc dest=text/
copyField source=type dest=text/
 
copyField source=manu dest=manu_exact/
copyField source=sport dest=sport_exact/
copyField source=type dest=type_exact/
 
 So, when doing keyword searches I use the field name=text... to search
 in all the fields, as I copyField all the fields onto the field named
 text.
 But, for faceting I use the exact fields, which are of type string and
 don't
 split on whitespace.
 
 
 Anshu
 
 On Wed, Jun 3, 2009 at 1:50 AM, Bny Jo bny...@yahoo.com wrote:
 

 Hello,

  I am wondering why solr is returning a manufacturer name field ( Dell,
 Inc) as Dell one result and Inc another result. Is there a way to facet a
 field which have space or delimitation on them?

 query.addFacetField(manu);
 query.setFacetMinCount(1);
query.setIncludeScore(true);
  ListFacetField facetFieldList=qr.getFacetFields();
for(FacetField facetField: facetFieldList){
System.out.println(facetField.toString() +Manufactures);
}
 And it returns
 -
 [manu:[dell (5), inc (5), corp (1), sharp (1), sonic (1), view (1),
 viewson
 (1), vizo (1)]]




 
 

-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-space-on-facet-field-tp23840037p23847742.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

Re: How to avoid space on facet field

2009-06-03 Thread Marc Sturlese

Yeah, that's the point. Once you have this, you can use copyField as was
wrote above with the string example.

Bny Jo wrote:
 
 Anshuman, thanks for you input. I will try that, I can understand what you
 are trying.  
 
 Marcus, I did not understand  how your KeyworkTokenizer work. Is that I
 have to define a septate field like what we have in example schema and
 call that field. This what I came up with.
 
  fieldType name=facet_tex class=solr.TextField sortMissingLast=true
 omitNorms=true
   analyzer
 
 tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /

 filter class=solr.PatternReplaceFilterFactory
 pattern=([^a-z]) replacement= replace=all
 /
   /analyzer
 /fieldType
 
 
 
 Thanks
 
 Boney
 
 
 
 From: Marc Sturlese marc.sturl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 3:45:49 AM
 Subject: Re: How to avoid space on facet field
 
 
 You can configure a facet_text instead of the normal text type. There
 you
 use KeyWordTokenizer instead of StandardTokenizer. One of the advantages
 of
 using it instead of string is that it will allow you to use synonyms,
 stopwords and filters and all the properties from an analyzer.
 
 
 Anshuman Manur wrote:
 
 Hey,
 
 From what you have written I'm guessing that in your schema.xml file, you
 have defined the field manu to be of type  text, which is good for
 keyword
 searches, as the text type indexes on whitespace, i.e. Dell Inc. is
 indexed
 as dell, inc. so keyword searches matches either dell or inc. But when
 you
 want to facet on a particular field, you want exact matches regardless of
 whitespace in between. In such cases its a good idea to use the string
 type.
 Let me illustrate with an example based on my settings:
 
 Here are my fields:
 
!-- Core Fields --
field name=id type=string indexed=true stored=true
 required=true /
field name=name type=text indexed=true stored=true/
field name=manu type=text indexed=true stored=true/
field name=sport type=text indexed=true stored=true /
field name=type type=text indexed=true stored=true /
field name=desc type=text indexed=true stored=true /
field name=ldesc type=text indexed=true stored=true /
 
!-- default text Field for searching --
field name=text type=text indexed=true stored=false
 multiValued=true/
 
!-- exact string fields for faceting --
field name=sport_exact type=string indexed=true stored=false
 /
field name=manu_exact type=string indexed=true stored=false
 /
field name=type_exact type=string indexed=true stored=false
 /
 
copyField source=manu dest=text/
copyField source=name dest=text/
copyField source=sport dest=text/
copyField source=desc dest=text/
copyField source=ldesc dest=text/
copyField source=type dest=text/
 
copyField source=manu dest=manu_exact/
copyField source=sport dest=sport_exact/
copyField source=type dest=type_exact/
 
 So, when doing keyword searches I use the field name=text... to
 search
 in all the fields, as I copyField all the fields onto the field named
 text.
 But, for faceting I use the exact fields, which are of type string and
 don't
 split on whitespace.
 
 
 Anshu
 
 On Wed, Jun 3, 2009 at 1:50 AM, Bny Jo bny...@yahoo.com wrote:
 

 Hello,

  I am wondering why solr is returning a manufacturer name field ( Dell,
 Inc) as Dell one result and Inc another result. Is there a way to facet
 a
 field which have space or delimitation on them?

 query.addFacetField(manu);
 query.setFacetMinCount(1);
query.setIncludeScore(true);
  ListFacetField facetFieldList=qr.getFacetFields();
for(FacetField facetField: facetFieldList){
System.out.println(facetField.toString()
 +Manufactures);
}
 And it returns
 -
 [manu:[dell (5), inc (5), corp (1), sharp (1), sonic (1), view (1),
 viewson
 (1), vizo (1)]]




 
 
 
 -- 
 View this message in context:
 http://www.nabble.com/How-to-avoid-space-on-facet-field-tp23840037p23847742.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
   
 

-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-space-on-facet-field-tp23840037p23850245.html
Sent from the Solr - User mailing list archive at Nabble.com.



Strange behaviour with copyField

2009-06-03 Thread James Grant
I've been hitting my head against a wall all morning trying to figure 
this out and haven't managed to get anywhere and wondered if anybody 
here can help.


I have defined a field type

   fieldType name=text_au class=solr.TextField 
positionIncrementGap=100

 analyzer
   tokenizer class=solr.LowerCaseTokenizerFactory /
 /analyzer
   /fieldType

I have two fields

field name=au type=text_au indexed=true stored=true 
required=false multiValued=true/
field name=author type=text_au indexed=true stored=false 
multiValued=true/


and a copyField line

copyField source=au dest=author /

The idea is to allow searching for authors so a search for 
author:(Hobbs A.U.) will match the au field value Hobbs A. U. 
(notice the space).


However the query au:(Hobbs A.U.) matches and the the query 
author:(Hobbs A.U.) does not.


Any ideas?

I'm using a Solr 1.4 snapshot

Regards

James




Re: Keyword Density

2009-06-03 Thread Alex Shevchenko
So, is there an ability to perform filtering as I described?

On Mon, Jun 1, 2009 at 22:24, Alex Shevchenko caeza...@gmail.com wrote:

 But I don't need to sort using this value. I need to cut results, where
 this value (for particular term of query!) not in some range.


 On Mon, Jun 1, 2009 at 22:20, Walter Underwood wunderw...@netflix.comwrote:

 That is the normal relevance scoring formula in Solr and Lucene.
 It is a bit fancier than that, but you don't have to do anything
 special to get that behavior.

 Solr also uses the inverse document frequency (rarity) of each
 word for weighting.

 Look up tf.idf for more info.

 wunder

 On 6/1/09 11:46 AM, Alex Shevchenko caeza...@gmail.com wrote:

  Something like that. Just not ' N times' but 'numbers of foo
  appears/total number of words  some value'
 
  On Mon, Jun 1, 2009 at 21:00, Otis Gospodnetic
  otis_gospodne...@yahoo.comwrote:
 
 
  Hi Alex,
 
  Could you please provide an example of this?  Are you looking to do
  something like find all docs that match name:foo and where foo appears
  N
  times (in the name field) in the matching document?
 
   Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Alex Shevchenko caeza...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Monday, June 1, 2009 1:32:49 PM
  Subject: Re: Keyword Density
 
  HI All,
 
  Is there a way to perform filtering based on keyword density?
 
  Thanks
 
  --
  Alex Shevchenko





 --
 Alex Shevchenko




-- 
Alex Shevchenko


Re: NPE in dataimport.DebugLogger.peekStack (DIH Development Console)

2009-06-03 Thread Shalin Shekhar Mangar
This is fixed in trunk. The next nightly build will have this fix. Thanks!

On Tue, Jun 2, 2009 at 9:49 PM, Steffen B. s.baumg...@fhtw-berlin.dewrote:


 Glad to hear that it's not a problem with my setup.
 Thanks for taking care of it! :)


 Shalin Shekhar Mangar wrote:
 
  On Tue, Jun 2, 2009 at 8:06 PM, Steffen B.
  s.baumg...@fhtw-berlin.dewrote:
 
 
  I'm trying to debug my DI config on my Solr server and it constantly
  fails
  with a NullPointerException:
  Jun 2, 2009 4:20:46 PM org.apache.solr.handler.dataimport.DataImporter
  doFullImport
  SEVERE: Full Import failed
  java.lang.NullPointerException
 at
 
 
 org.apache.solr.handler.dataimport.DebugLogger.peekStack(DebugLogger.java:78)
 at
  org.apache.solr.handler.dataimport.DebugLogger.log(DebugLogger.java:98)
 at
  org.apache.solr.handler.dataimport.SolrWriter.log(SolrWriter.java:248)
 at...
 
  Running a normal full-import works just fine, but whenever I try to run
  the
  debugger, it gives me this error. I'm using the most recent Solr nightly
  build (2009-06-01) and the method in question is:
  private DebugInfo peekStack() {
 return debugStack.isEmpty() ? null : debugStack.peek();
  }
  I'm using a DI config that has been working fine in for several previous
  builds, so that shouldn't be the problem... any ideas what the problem
  could
  be?
 
 
 
  A previous commit to change the EntityProcessor API broke this
  functionality. I'll open an issue and give a patch.
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 

 --
 View this message in context:
 http://www.nabble.com/NPE-in-dataimport.DebugLogger.peekStack-%28DIH-Development-Console%29-tp23833878p23835897.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Strange behaviour with copyField

2009-06-03 Thread Otis Gospodnetic

James,

I don't see the error, but this is exactly what Solr Admin's analysis page will 
quickly help you with! :)

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: James Grant james.gr...@semantico.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 8:09:10 AM
 Subject: Strange behaviour with copyField
 
 I've been hitting my head against a wall all morning trying to figure this 
 out 
 and haven't managed to get anywhere and wondered if anybody here can help.
 
 I have defined a field type
 
   
 
   
 
   
 
 I have two fields
 
 
 multiValued=true/
 
 multiValued=true/
 
 and a copyField line
 
 
 
 The idea is to allow searching for authors so a search for author:(Hobbs 
 A.U.) 
 will match the au field value Hobbs A. U. (notice the space).
 
 However the query au:(Hobbs A.U.) matches and the the query author:(Hobbs 
 A.U.) does not.
 
 Any ideas?
 
 I'm using a Solr 1.4 snapshot
 
 Regards
 
 James



Re: Keyword Density

2009-06-03 Thread Otis Gospodnetic

I don't think this is possible without changing Solr.
Or maybe it's possible with a custom Search Component that looks at all hits 
and checks the df (document frequency) for a term in each document?  Sounds 
like a very costly operation...

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Alex Shevchenko caeza...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 9:57:29 AM
 Subject: Re: Keyword Density
 
 So, is there an ability to perform filtering as I described?
 
 On Mon, Jun 1, 2009 at 22:24, Alex Shevchenko wrote:
 
  But I don't need to sort using this value. I need to cut results, where
  this value (for particular term of query!) not in some range.
 
 
  On Mon, Jun 1, 2009 at 22:20, Walter Underwood wrote:
 
  That is the normal relevance scoring formula in Solr and Lucene.
  It is a bit fancier than that, but you don't have to do anything
  special to get that behavior.
 
  Solr also uses the inverse document frequency (rarity) of each
  word for weighting.
 
  Look up tf.idf for more info.
 
  wunder
 
  On 6/1/09 11:46 AM, Alex Shevchenko wrote:
 
   Something like that. Just not ' N times' but '
   appears/ '
  
   On Mon, Jun 1, 2009 at 21:00, Otis Gospodnetic
   wrote:
  
  
   Hi Alex,
  
   Could you please provide an example of this?  Are you looking to do
   something like find all docs that match name:foo and where foo appears
   N
   times (in the name field) in the matching document?
  
Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: Alex Shevchenko 
   To: solr-user@lucene.apache.org
   Sent: Monday, June 1, 2009 1:32:49 PM
   Subject: Re: Keyword Density
  
   HI All,
  
   Is there a way to perform filtering based on keyword density?
  
   Thanks
  
   --
   Alex Shevchenko
 
 
 
 
 
  --
  Alex Shevchenko
 
 
 
 
 -- 
 Alex Shevchenko



filter on millions of IDs from external query

2009-06-03 Thread Ryan McKinley
I am working with an in index of ~10 million documents.  The index  
does not change often.


I need to preform some external search criteria that will return some  
number of results -- this search could take up to 5 mins and return  
anywhere from 0-10M docs.


I would like to use the output of this long running query as a filter  
in solr.


Any suggestions on how to wire this all together?

My initial ideas (I have not implemented anything yet -- just want to  
check with you all before starting down the wrong path) is to:
* assume the index will always be optimized, in this case every id  
maps to a lucene int id.

* Store the results of the expensive query as a bitset.
* use the stored bitset in the lucene query.

I'm sure I can get this to work, but it seems kinda ugly (and  
brittle).  Any better thoughts on how to do this?  If we had some sort  
of external tagging interface, each document could just get tagged  
with what query it matches.


thanks
ryan




MoreLikeThis query

2009-06-03 Thread SergeyG

Hi,

I'm adding the MoreLikeThis functionality to my search.

1. Do I understand it right that the query:
q=id:1mlt=truemlt.fl=content
will bring back documents in which the most important terms of the content
field are partly the same as those of the content field of the doc with
id=1?

2. Also, the full request url for the above mentioned query would be:
solr_base_url/select?q=id:1mlt=truemlt.fl=content
which is equivalent to the query:
solr_base_url/mlt?q=id:1mlt.fl=content
But while the former query would be handled by the StandardRequestHandler
and executed by calling server.query(query), the latter query is handled by
the MoreLikeThisRequestHandler and there is no specific method to execute
it. Is this right? 
And if this is the case, how can the latter query be triggered?

Thanks,
Sergey
-- 
View this message in context: 
http://www.nabble.com/MoreLikeThis-query-tp23856526p23856526.html
Sent from the Solr - User mailing list archive at Nabble.com.



Solr search by segment

2009-06-03 Thread aidahaj

Hi,
I have an index in wich I am always indexing the same documents
(re-indexing).
So I need to search for them by their number of segment.
When I ask solrj for the documents by their segment [for example:
solrj.query(segment:20090603142546);] , he doesn't return any thing. I
checked the schema.xml and the field segment is stored and indexed.
What may I do?
I am looking for your help. Thanks.
-- 
View this message in context: 
http://www.nabble.com/Solr-search-by-segment-tp23856569p23856569.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bradford Stephens
Hey everyone!
I just wanted to give a BIG THANKS for everyone who came. We had over a
dozen people, and a few got lost at UW :)  [I would have sent this update
earlier, but I flew to Florida the day after the meeting].

If you didn't come, you missed quite a bit of learning and topics. Such as:

-Building a Social Media Analysis company on the Apache Cloud Stack
-Cancer detection in images using Hadoop
-Real-time OLAP
-Scalable Lucene using Katta and Hadoop
-Video and Network Flow
-Custom Ranking in Lucene

I'm going to update our wiki with the topics, and a few questions raised and
the lessons we've learned.

The next meetup will be June 24th. Be there, or be... boring :)

Cheers,
Bradford

On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens 
bradfordsteph...@gmail.com wrote:

 Greetings,

 Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
 with me in the Seattle area? I can donate some facilities, etc. -- I
 also always have topics to speak about :)

 Cheers,
 Bradford



Re: fq vs. q

2009-06-03 Thread Martin Davidsson
On Wed, Jun 3, 2009 at 1:53 AM, Marc Sturlese marc.sturl...@gmail.comwrote:


 It's definitely not proper documentation but maybe can give you a hand:

 http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/


 Martin Davidsson-2 wrote:
 
  I've tried to read up on how to decide, when writing a query, what
  criteria goes in the q parameter and what goes in the fq parameter, to
  achieve optimal performance. Is there some documentation that
  describes how each field is treated internally, or even better, some
  kind of rule of thumb to help me decide how to split things up when
  querying against one or more fields. In most cases, I'm looking for
  exact matches but sometimes an occasional wildcard query shows up too.
  Thank you!
 
  -- Martin
 
 

 --
 View this message in context:
 http://www.nabble.com/fq-vs.-q-tp23845282p23847845.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Thanks, I'd seen that article too. I totally agree that it's worth
understanding how things are treated under the hood. That's the kind of
literature I'm looking for I guess. Given that article, I wasn't sure what
the query would look like if I need to query against multiple fields. Let's
say I have a name field and a brand field and I want to find the Apple
iPod. Using only the 'q' param the query would look like
select?q=brand:Apple AND name:iPod

Is there a better query format that utilizes the fq field? Thanks again

-- Martin


Re: Solr vs Sphinx

2009-06-03 Thread Otis Gospodnetic

Hi,

Could you please start a new thread?


Thanks,
Otis


- Original Message 
 From: sunnyfr johanna...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 10:20:06 AM
 Subject: Re: Solr vs Sphinx
 
 
 Hi guys,
 
 I work now for serveral month on solr and really you provide quick answer
 ... and you're very nice to work with.
 But I've got huge issue that I couldn't fixe after lot of post.
 
 My indexation take one two days to be done. For 8G of data indexed and 1,5M
 of docs (ok I've plenty of links in my table but it takes such a long time).
 
 Second I've to do update every 20mn but every update represent maybe 20
 000docs
 and when I use the replication I must replicate all the new index folder
 optimized because Ive too much datas updated and too much segment needs to
 be generate and I have to merge datas. So I lost my cache and my CPU goes
 mad.
 
 And I can't have more than 20request/sec.
 
 
 
 
 Fergus McMenemie-2 wrote:
  
 Something that would be interesting is to share solr configs for  
 various types of indexing tasks.  From a solr configuration aimed at  
 indexing web pages to one doing large amounts of text to one that  
 indexes specific structured data.  I could see those being posted on  
 the wiki and helping folks who say I want to do X, is there an  
 example?.
 
 I think most folks start with the example Solr install and tweak from  
 there, which probably isn't the best path...
 
 Eric
  
  Yep a solr cookbook with lots of different example recipes. However
  these would need to be very actively maintained to ensure they always
  represented best practice. While using cocoon I made extensive use
  of the examples section of the cocoon website. However most of the,
  massive number of, examples represent obsolete cocoon practise. Or 
  there were four or five examples doing the same thing in different 
  ways with no text explaining the pros/cons of the different approaches.
  This held me, as a newcomer, back and gave a bad impression of cocoon.
  
  I was wondering about a performance hints page. I was caught by an
  issue indexing CSV content where the use of overwrite=false made
  an almost 3x difference to my indexing speed. Still do not really
  know why!
  
 
 On May 15, 2009, at 8:09 AM, Mark Miller wrote:
 
  In the spirit of good defaults:
 
  I think we should change the Solr highlighter to highlight phrase  
  queries by default, as well as prefix,range,wildcard constantscore  
  queries. Its awkward to have to tell people you have to turn those  
  on. I'd certainly prefer to have to turn them off if I have some  
  limitation rather than on.
  
  Yep I agree, all whizzy new features should ideally be on by default
  unless there is a significant performance penalty. It is not enough
  that to issue a default solrconfig.xml with the feature on, it has to
  be on by default inside the code.
   
 
  - Mark
 
 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
 http://www.opensourceconnections.com
 Free/Busy: http://tinyurl.com/eric-cal
  
  Fergus
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Solr-vs-Sphinx-tp23524676p23852364.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Token filter on multivalue field

2009-06-03 Thread Otis Gospodnetic

Hello,

It's ugly, but the first thing that came to mind was ThreadLocal.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: David Giffin da...@giffin.org
 To: solr-user@lucene.apache.org
 Sent: Wednesday, June 3, 2009 1:57:42 PM
 Subject: Token filter on multivalue field
 
 Hi There,
 
 I'm working on a unique token filter, to eliminate duplicates on a
 multivalue field. My filter works properly for a single value field.
 It seems that a new TokenFilter is created for each value in the
 multivalue field. I need to maintain an array of used tokens across
 all of the values in the multivalue field. Is there a good way to do
 this? Here is my current code:
 
 public class UniqueTokenFilter extends TokenFilter {
 
 private ArrayList words;
 public UniqueTokenFilter(TokenStream input) {
 super(input);
 this.words = new ArrayList();
 }
 
 @Override
 public final Token next(Token in) throws IOException {
 for (Token token=input.next(in); token!=null; token=input.next()) {
 if ( !words.contains(token.term()) ) {
 words.add(token.term());
 return token;
 }
 }
 return null;
 }
 }
 
 Thanks,
 David



Re: Strange behaviour with copyField

2009-06-03 Thread Grant Ingersoll


On Jun 3, 2009, at 5:09 AM, James Grant wrote:

I've been hitting my head against a wall all morning trying to  
figure this out and haven't managed to get anywhere and wondered if  
anybody here can help.


I have defined a field type

  fieldType name=text_au class=solr.TextField  
positionIncrementGap=100

analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer
  /fieldType

I have two fields

field name=au type=text_au indexed=true stored=true  
required=false multiValued=true/
field name=author type=text_au indexed=true stored=false  
multiValued=true/


I don't see the difference, as they are the same FieldType for each  
field, text_au.  Is this a typo or am I missing something?





and a copyField line

copyField source=au dest=author /

The idea is to allow searching for authors so a search for author: 
(Hobbs A.U.) will match the au field value Hobbs A. U. (notice  
the space).


What would lower casing do for handling the space?




However the query au:(Hobbs A.U.) matches and the the query  
author:(Hobbs A.U.) does not.


Any ideas?



How are you indexing?

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: Solr search by segment

2009-06-03 Thread aidahaj

I must precise that I am running nutch-solr-integration and both schema.xml
are the same in nutch or in solr.
-- 
View this message in context: 
http://www.nabble.com/Solr-search-by-segment-tp23856569p23859728.html
Sent from the Solr - User mailing list archive at Nabble.com.



Which caches should use the solr.FastLRUCache

2009-06-03 Thread Robert Purdy

Hey there, 

Anyone got any advice on which caches (filterCache, queryResultCache,
documentCache, fieldValueCache) should be implemented using the
solr.FastLRUCache in solr 1.4 and what are the pros  cons 
vs the solr.LRUCache.

Thanks Robert.
-- 
View this message in context: 
http://www.nabble.com/Which-caches-should-use-the-solr.FastLRUCache-tp23860182p23860182.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: synonyms

2009-06-03 Thread anuvenk

I happened to revisit this post that I had started long time back. I'm still
using the same query time synonyms. Now i want to be able to map cities to
states in the synonyms and continuing to have this issue with the multi-word
synonyms. Could you please explain what you've done to overcome this issue
again please. I didn't quite understand what HIER_FAMILIY_01, SYN_FAMILY_01
are. Thanks.

lorenzo zhak wrote:
 
 Hi,
 
 I had to work with this kind of sides effects reguarding multiwords
 synonyms.
 We installed solr on our project that extensively uses synonyms, a big
 list that sometimes could bring out some wrong match as the one
 noticed by Anuvenk
 for instance
 
 dui = drunk driving defense
  or
 dui,drunk driving defense,drunk driving law
 query for dui matches dui = drunk driving defense and dui,drunk
 driving defense,drunk driving law
 
 in order to prevent this kind of behavior I gave for every synonyms
 family (saying a single line in the file) a unique identifier,
 so the list looks like :
 
 dui = HIER_FAMILIY_01
 drunk driving defense = HIER_FAMILIY_01
 SYN_FAMILY_01, dui,drunk driving defense,drunk driving law
 
 I also set the synonyms filter at index time with expand=false, and at
 query time with expand=false
 
 so in this way, the matched synonyms (multi words or single words) in
 documents are replaced with their family identifier, and not all the
 possibilities. Indexing with expand=true will add words in documents
 that could be matched alone, ignoring the fact that they belong to
 multiwords expression, and this could end up with a wrong match
 (intending syns mix) at query time.
 
 so in this way a query for dui, will be changed by the synonym
 filter at query time with HIER_FAMILIY_01 or SYN_FAMILY_01 so
 documents that contains only single words like drunk, driving or
 law will not be matched since only a document with the phrase drunk
 driving law would have been indexed with SYN_FAMILY_01.
 
 The approach worked pretty good on our project and we do not notice
 any sides effects on the searches, it only removes matched documents
 that were considered as noise of the synonyms mix issue.
 
 I think this could be usefull to add this kind of approach on the solr
 synoyms filter section of the wiki,
 
 Cheers
 
 Laurent
 
 
 On Dec 2, 2007 3:41 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:
 Hi (changing to solr-user list)

 Yes it is, especially if the terms left of = are multi-spaced.  Check
 out the Wiki, one page there explains this nicely.

 Otis
 -
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-...@lucene.apache.org
 Sent: Saturday, December 1, 2007 1:21:49 AM
 Subject: Re: synonyms


 Ideally, would it be a good idea to pass the index data through the
  synonyms
 filter while indexing?
 Also,
 say i have this mapping
 dui = drunk driving defense
  or
 dui,drunk driving defense,drunk driving law

 so matches for dui, will also bring up matches for drunk driving law
  (the
 whole phrase) or does it also bring up all matches for 'drunk' ,
 'driving','law'  ?



 Yonik Seeley wrote:
 
  On Nov 30, 2007 5:39 PM, anuvenk anuvenkat...@hotmail.com wrote:
  Should data be re-indexed everytime synonyms like
  word1,word2
  or
  word1 = word2
 
  are added to synonyms.txt
 
  Yes, if it changes the index (if it's used in the index anaylzer as
  opposed to just the query analyzer).
 
  -Yonik
 
 

 --
 View this message in context:
  http://www.nabble.com/synonyms-tf4925232.html#a14100346
 Sent from the Solr - Dev mailing list archive at Nabble.com.





 
 

-- 
View this message in context: 
http://www.nabble.com/Re%3A-synonyms-tp14116132p23860862.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there Downside to a huge synonyms file?

2009-06-03 Thread anuvenk

I tried adding some city to state mappings in the synonyms file. I'm using
the dismax handler for phrase matching. So as  when i add more  more city
to state mappings, I end up with zero results for state based searches.
Eg: ca,california,los angeles
 ca,california,san diego
 ca,california,san francisco
 ca,california,burbankand so on
now a city based search returns a few other california results but a state
based search like dui california is returning zero results. 
I checked the parsedquery_toString and I see no 'OR' although the default
operator is 'OR' in schema. It looks like its trying to find matches for all
those cities as they are mapped to 'california' and hence returns zero
results. How to force dismax to use 'OR' and not 'AND' even though the
schema has 'OR'.
Or is this how dismax works? Can someone explain how to overcome this
problem. 
Here is my custom request handler that extends dismax
requestHandler name=qfacet class=solr.DisMaxRequestHandler 
lst name=defaults
 str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qfname^2.0 text^0.8/str
 !-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 - 5
shld match; above 6 - 90% match --
 str name=mm3lt;-1 4lt;-1 5lt;-1 6lt;90%/str
 str name=pf
 text^0.8 name^2.0
 /str
 int name=qs4/int
 int name=ps4/int
 str name=fl
 *,score
 /str  

/lst
lst name=invariants
  !--str name=facet.fieldresourceType/str
  str name=facet.fieldcategory/str
  str name=facet.fieldstateName/str--
  str name=facet.sortfalse/str
  int name=facet.mincount1/int
/lst
  /requestHandler

Thanks.



Otis Gospodnetic wrote:
 
 
 Hello,
 
 300K is a pretty small index.  I wouldn't worry about the number of
 synonyms unless you are turning a single term into dozens of ORed terms.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 2, 2009 11:28:43 PM
 Subject: Re: Is there Downside to a huge synonyms file?
 
 
 I'm using query time synonyms. I have more fields in my index though.
 This is
 just an example or sample of data from my index. Yes, we don't have
 millions
 of documents. Could be around 300,000 and might increase in future. The
 reason i'm using query time synonyms is because of the nature of my data.
 I
 can't re-index the data everytime i add or remove a synonym. But for this
 particular requirement is it best to have index time synonyms because of
 the
 multi-word synonym nature. Again if i add more cities list to the synonym
 file, I can't be re-indexing all the data over and over again. 
 
 
 
 anuvenk wrote:
  
  In my index i have legal faqs, forms, legal videos etc with a state
 field
  for each resource.
  Now if i search for real estate san diego, I want to be able to return
  other 'california' results i.e results from san francisco.
  I have the following fields in the index
  
  title  state  
  description...
  real estate san diego example 1   california some
  description
  real estate carlsbad example 2 california some desc
  
  so when i search for real estate san francisco, since there is no
 match, i
  want to be able to return the other real estate results in california
  instead of returning none. Because sometimes they might be searching
 for a
  real estate form and city probably doesn't matter. 
  
  I have two things in mind. One is adding a synonym mapping
  san diego, california
  carlsbad, california
  san francisco, california
  
  (which probably isn't the best way)
  hoping that search for san francisco real estate would map san
 francisco
  to california and hence return the other two california results
  
  OR
  
  adding the mapping of city to state in the index itself like..
  
  title state city   
 
 
  description...
  real estate san diego eg 1california   carlsbad, san francisco, san
  diegosome description
  real estate carlsbad eg 2  california   carlsbad, san francisco,
 san
  diegosome description
  
  which of the above two is better. Does a huge synonym file affect
  performance. Or Is there a even better way? I'm sure there is but I
 can't
  put my finger on it yet  I'm not familiar with java either.
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23861631.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there Downside to a huge synonyms file?

2009-06-03 Thread anuvenk

A small addition to my earlier post. I wonder if its because of the 'mm'
param, which requires that until 3 words in search phrase, all the words
should be matched. If i alter this now, i'd get ir-relevant results for a
lot of popular 1, 2, 3 word search terms. How to solve for this? 

anuvenk wrote:
 
 I tried adding some city to state mappings in the synonyms file. I'm using
 the dismax handler for phrase matching. So as  when i add more  more
 city to state mappings, I end up with zero results for state based
 searches.
 Eg: ca,california,los angeles
  ca,california,san diego
  ca,california,san francisco
  ca,california,burbankand so on
 now a city based search returns a few other california results but a state
 based search like dui california is returning zero results. 
 I checked the parsedquery_toString and I see no 'OR' although the default
 operator is 'OR' in schema. It looks like its trying to find matches for
 all those cities as they are mapped to 'california' and hence returns zero
 results. How to force dismax to use 'OR' and not 'AND' even though the
 schema has 'OR'.
 Or is this how dismax works? Can someone explain how to overcome this
 problem. 
 Here is my custom request handler that extends dismax
 requestHandler name=qfacet class=solr.DisMaxRequestHandler 
 lst name=defaults
  str name=echoParamsexplicit/str
  float name=tie0.01/float
  str name=qfname^2.0 text^0.8/str
  !-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 -
 5 shld match; above 6 - 90% match --
  str name=mm3lt;-1 4lt;-1 5lt;-1 6lt;90%/str
  str name=pf
  text^0.8 name^2.0
  /str
  int name=qs4/int
  int name=ps4/int
  str name=fl
  *,score
  /str  
 
 /lst
 lst name=invariants
   !--str name=facet.fieldresourceType/str
   str name=facet.fieldcategory/str
   str name=facet.fieldstateName/str--
   str name=facet.sortfalse/str
   int name=facet.mincount1/int
 /lst
   /requestHandler
 
 Thanks.
 
 
 
 Otis Gospodnetic wrote:
 
 
 Hello,
 
 300K is a pretty small index.  I wouldn't worry about the number of
 synonyms unless you are turning a single term into dozens of ORed terms.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 2, 2009 11:28:43 PM
 Subject: Re: Is there Downside to a huge synonyms file?
 
 
 I'm using query time synonyms. I have more fields in my index though.
 This is
 just an example or sample of data from my index. Yes, we don't have
 millions
 of documents. Could be around 300,000 and might increase in future. The
 reason i'm using query time synonyms is because of the nature of my
 data. I
 can't re-index the data everytime i add or remove a synonym. But for
 this
 particular requirement is it best to have index time synonyms because of
 the
 multi-word synonym nature. Again if i add more cities list to the
 synonym
 file, I can't be re-indexing all the data over and over again. 
 
 
 
 anuvenk wrote:
  
  In my index i have legal faqs, forms, legal videos etc with a state
 field
  for each resource.
  Now if i search for real estate san diego, I want to be able to return
  other 'california' results i.e results from san francisco.
  I have the following fields in the index
  
  title  state  
  description...
  real estate san diego example 1   california some
  description
  real estate carlsbad example 2 california some
 desc
  
  so when i search for real estate san francisco, since there is no
 match, i
  want to be able to return the other real estate results in california
  instead of returning none. Because sometimes they might be searching
 for a
  real estate form and city probably doesn't matter. 
  
  I have two things in mind. One is adding a synonym mapping
  san diego, california
  carlsbad, california
  san francisco, california
  
  (which probably isn't the best way)
  hoping that search for san francisco real estate would map san
 francisco
  to california and hence return the other two california results
  
  OR
  
  adding the mapping of city to state in the index itself like..
  
  title state city  
  
 
  description...
  real estate san diego eg 1california   carlsbad, san francisco,
 san
  diegosome description
  real estate carlsbad eg 2  california   carlsbad, san francisco,
 san
  diegosome description
  
  which of the above two is better. Does a huge synonym file affect
  performance. Or Is there a even better way? I'm sure there is but I
 can't
  put my finger on it yet  I'm not familiar with java either.
  
  
 
 -- 
 View this message in context: 
 

OPI: Article on Sunspot

2009-06-03 Thread Glen Newton
Sunspot: A Solr-Powered Search Engine for Ruby
http://www.linux-mag.com/id/7341

glen
http://zzzoot.blogspot.com/

-- 

-


where to find solr help/consultant

2009-06-03 Thread Larry Eitel
I am implementing solr on Centos server. It involves handling
multi-languages. Where is the best place to look for developers experienced
in solr who may be interested in a little consulting work. Mostly to give
some guidance, etc. IRC is rather quite.
Thank you :)