RE: Best use of wildcard searches

2007-08-09 Thread Pierre-Yves LANDRON
Hello I'm exactly in the same situation as you. I've got some structured 
subject ( as subjects:main subject/sub subject/sub sub subject ) and want to 
search them as litteral from a given level (subjects:main subject/*). As you 
know subjects:main subject/* doesn't work (but it should, shouldn't it ?), so 
what i've done is to replace the space caracter by the wildcard '?'  in my 
query, as this : subjects:main?subject/*. It works, even if it isn't very 
elegant.Is it a great loss, performance-wise ? Could somebody tell ?Sure, a 
better solution would be appreciated.Kind regards,Pierre-Yves Landron From: 
[EMAIL PROTECTED] Subject: Re: Best use of wildcard searches Date: Wed, 8 Aug 
2007 14:59:36 -0700 To: solr-user@lucene.apache.org  OK.  So a followup 
question..  ?q=department_exact:Apparel%3EMen's%  
20Apparel*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true  
returns 0 results. Note the %20 in there for the space character.  
?q=department_exact:Apparel%  
3EMen's*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true returns 
several, and the only change is that I've truncated Men's   Apparel* to be 
Men's*.  (example department_exacts from this result set below..)  
ApparelMen's ApparelSweatshirtsHooded ApparelMen's ApparelShirtsTank 
TopWorkout ApparelMen's ApparelSweatshirts ApparelMen's 
ApparelSweatshirts ApparelMen's ApparelJacketsWindbreaker    Any 
ideas why ApparelMen's* would work, but ApparelMen's   Apparel* would 
not?  ++   | Matthew 
Runo   | Zappos Development   | [EMAIL PROTECTED]   | 702-943-7833 
++   On Aug 8, 2007, 
at 2:42 PM, Yonik Seeley wrote:   On 8/8/07, Matthew Runo [EMAIL 
PROTECTED] wrote:  I've been using the standard query handler to do 
searches like   q=department_exact:FooBarBazQux   Now, lets 
assume I have lots of records, with various department  trees...  1. 
FooBarBazQux  2. FooBarBazPut  3. FooBarSomething With 
SpacesElese  4. FooTotalyDifferentTree   I'd like to get all the 
products at various levels, and all the  levels below.   I have a 
tokenzied department field, and a copyFielddepartment_exact.   
I've been doing searches on the department_exact feild, thinking I  could do 
this..   q=department_exact:FooBar*   A * inside quotes is 
literal.  Try  q=department_exact:FooBar*  Or if  is a reserved 
character, escape it with \  q=department_exact:Foo\Bar*   If Bar is 
unique (only under Foo), you could use a copyfield to copy  it to a regex 
tokenizer to split on  and then do a simple search on  Bar   -Yonik 
 
_
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Best use of wildcard searches

2007-08-09 Thread Erick Erickson
I just saw an e-mail from Yonik suggesting escaping the space. I know
so little about Solr that all I can do is parrot Yonik...

Erick

On 8/8/07, Matthew Runo [EMAIL PROTECTED] wrote:

 OK.

 So a followup question..

 ?q=department_exact:Apparel%3EMen's%
 20Apparel*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true

 returns 0 results. Note the %20 in there for the space character.

 ?q=department_exact:Apparel%
 3EMen's*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true
 returns several, and the only change is that I've truncated Men's
 Apparel* to be Men's*.

 (example department_exacts from this result set below..)

 ApparelMen's ApparelSweatshirtsHooded
 ApparelMen's ApparelShirtsTank TopWorkout
 ApparelMen's ApparelSweatshirts
 ApparelMen's ApparelSweatshirts
 ApparelMen's ApparelJacketsWindbreaker

 

 Any ideas why ApparelMen's* would work, but ApparelMen's
 Apparel* would not?

 ++
   | Matthew Runo
   | Zappos Development
   | [EMAIL PROTECTED]
   | 702-943-7833
 ++


 On Aug 8, 2007, at 2:42 PM, Yonik Seeley wrote:

  On 8/8/07, Matthew Runo [EMAIL PROTECTED] wrote:
  I've been using the standard query handler to do searches like
 
  q=department_exact:FooBarBazQux
 
  Now, lets assume I have lots of records, with various department
  trees...
  1. FooBarBazQux
  2. FooBarBazPut
  3. FooBarSomething With SpacesElese
  4. FooTotalyDifferentTree
 
  I'd like to get all the products at various levels, and all the
  levels below.
 
  I have a tokenzied department field, and a copyField
  department_exact.
 
  I've been doing searches on the department_exact feild, thinking I
  could do this..
 
  q=department_exact:FooBar*
 
  A * inside quotes is literal.
  Try
  q=department_exact:FooBar*
  Or if  is a reserved character, escape it with \
  q=department_exact:Foo\Bar*
 
  If Bar is unique (only under Foo), you could use a copyfield to copy
  it to a regex tokenizer to split on  and then do a simple search on
  Bar
 
  -Yonik
 




Too many open files

2007-08-09 Thread Kevin Holmes
result status=1java.io.FileNotFoundException:
/usr/local/bin/apache-solr/enr/solr/data/index/_16ik.tii (Too many open
files)

 

When I'm importing, this is the error I get.  I know it's vague and
obscure.  Can someone suggest where to start?  I'll buy a bag of MMs
(not peanut) for anyone who can help me solve this*

 

*limit one bag per successful solution for a total maximum of 1 bag to
be given



question: how to divide the indexing into sperate domains

2007-08-09 Thread Ben Shlomo, Yatir
Hi!

say I have 300 csv files that I need to index. 

Each one holds millions of lines (each line is a few fields separated by
commas)

Each csv file represents a different domain of data (e,g, file1 is
computers, file2 is flowers, etc)

There is no indication of the domain ID in the data inside the csv file

 

When I search I would like to specify the id of a specific domain

And I want solr to search only in this domain - to save time and reduce
the number of matches

I need to specify during indexing - the domain id of the csv file being
indexed

How do I do it ?

 

 

Thanks 

 

 

 

p.s. 

I wish I could index like this:

curl
http://localhost:8080/solr/update/csv?stream.file=test.csvfieldnames=fi
eld1,field2f.domain.value=98765
http://localhost:8080/solr/update/csv?stream.file=test.csvfieldnames=f
ield1,field2f.domain.value=98765  (where 98765 is the domain id for
ths specific csv file)



RE: Too many open files

2007-08-09 Thread Kevin Holmes
You're a gentleman and a scholar.  I will donate the MMs to myself :).
Can you tell me from this snippet of my solrconfig.xml what I might
tweak to make this more betterer?

-KH

  indexDefaults
   !-- Values here affect all index writers and act as a default unless
overridden. --
useCompoundFilefalse/useCompoundFile
mergeFactor10/mergeFactor
maxBufferedDocs1000/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
  /indexDefaults


RE: Too many open files

2007-08-09 Thread Jonathan Woods
You could try committing updates more frequently, or maybe optimising the
index beforehand (and even during!).  I imagine you could also change the
Solr config, if you have access to it, to tweak indexing (or index creation)
parameters - http://wiki.apache.org/solr/SolrConfigXml should be of use to
you here.

In the unlikely event I qualify for the MMs, I hereby donate them back to
you for giving to someone else!

Jon

-Original Message-
From: Kevin Holmes [mailto:[EMAIL PROTECTED] 
Sent: 09 August 2007 15:23
To: solr-user@lucene.apache.org
Subject: Too many open files

result status=1java.io.FileNotFoundException:
/usr/local/bin/apache-solr/enr/solr/data/index/_16ik.tii (Too many open
files)

 

When I'm importing, this is the error I get.  I know it's vague and obscure.
Can someone suggest where to start?  I'll buy a bag of MMs (not peanut) for
anyone who can help me solve this*

 

*limit one bag per successful solution for a total maximum of 1 bag to be
given




Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Kevin Holmes
I inherited an existing (working) solr indexing script that runs like
this:

 

Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 

We're injecting about 1000 records / minute (constantly), frequently
pushing the edge of our CPU / RAM limitations.

 

I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script
(instead of 3).

 

Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 

2: Is there a way to inject into solr without using POST / curl / http?

 

Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Clay Webster
Condensing the loader into a single executable sounds right if
you have performance problems. ;-)

You could also try adding multiple docs in a single post if you
notice your problems are with tcp setup time, though if you're
doing localhost connections that should be minimal.

If you're already local to the solr server, you might check out the
CSV slurper. http://wiki.apache.org/solr/UpdateCSV  It's a little
specialized.

And then there's of course the question of are you doing full
re-indexing or incremental indexing of changes?

--cw


On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:

 I inherited an existing (working) solr indexing script that runs like
 this:



 Python script queries the mysql DB then calls bash script

 Bash script performs a curl POST submit to solr



 We're injecting about 1000 records / minute (constantly), frequently
 pushing the edge of our CPU / RAM limitations.



 I'm in the process of building a Perl script to use DBI and
 lwp::simple::post that will perform this all from a single script
 (instead of 3).



 Two specific questions

 1: Does anyone have a clever (or better) way to perform this process
 efficiently?



 2: Is there a way to inject into solr without using POST / curl / http?



 Admittedly, I'm no solr expert - I'm starting from someone else's setup,
 trying to reverse-engineer my way out.  Any input would be greatly
 appreciated.




RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread David Whalen
What we're looking for is a way to inject *without* using
curl, or wget, or any other http-based communication.  We'd
like for the HTTP daemon to only handle search requests, not
indexing requests on top of them.

Plus, I have to believe there's a faster way to get documents
into solr/lucene than using curl

_
david whalen
senior applications developer
eNR Services, Inc.
[EMAIL PROTECTED]
203-849-7240
  

 -Original Message-
 From: Clay Webster [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 09, 2007 11:43 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Any clever ideas to inject into solr? Without http?
 
 Condensing the loader into a single executable sounds right 
 if you have performance problems. ;-)
 
 You could also try adding multiple docs in a single post if 
 you notice your problems are with tcp setup time, though if 
 you're doing localhost connections that should be minimal.
 
 If you're already local to the solr server, you might check 
 out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV  
 It's a little specialized.
 
 And then there's of course the question of are you doing 
 full re-indexing or incremental indexing of changes?
 
 --cw
 
 
 On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
 
  I inherited an existing (working) solr indexing script that 
 runs like
  this:
 
 
 
  Python script queries the mysql DB then calls bash script
 
  Bash script performs a curl POST submit to solr
 
 
 
  We're injecting about 1000 records / minute (constantly), 
 frequently 
  pushing the edge of our CPU / RAM limitations.
 
 
 
  I'm in the process of building a Perl script to use DBI and 
  lwp::simple::post that will perform this all from a single script 
  (instead of 3).
 
 
 
  Two specific questions
 
  1: Does anyone have a clever (or better) way to perform 
 this process 
  efficiently?
 
 
 
  2: Is there a way to inject into solr without using POST / 
 curl / http?
 
 
 
  Admittedly, I'm no solr expert - I'm starting from someone else's 
  setup, trying to reverse-engineer my way out.  Any input would be 
  greatly appreciated.
 
 
 


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Tobin Cataldo
(re)building the index separately (ie. on a different computer) and then 
replacing the active index may be an option.


David Whalen wrote:

What we're looking for is a way to inject *without* using
curl, or wget, or any other http-based communication.  We'd
like for the HTTP daemon to only handle search requests, not
indexing requests on top of them.

Plus, I have to believe there's a faster way to get documents
into solr/lucene than using curl

_
david whalen
senior applications developer
eNR Services, Inc.
[EMAIL PROTECTED]
203-849-7240
  

  

-Original Message-
From: Clay Webster [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 11:43 AM

To: solr-user@lucene.apache.org
Subject: Re: Any clever ideas to inject into solr? Without http?

Condensing the loader into a single executable sounds right 
if you have performance problems. ;-)


You could also try adding multiple docs in a single post if 
you notice your problems are with tcp setup time, though if 
you're doing localhost connections that should be minimal.


If you're already local to the solr server, you might check 
out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV  
It's a little specialized.


And then there's of course the question of are you doing 
full re-indexing or incremental indexing of changes?


--cw


On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:

I inherited an existing (working) solr indexing script that 
  

runs like


this:



Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr



We're injecting about 1000 records / minute (constantly), 
  
frequently 


pushing the edge of our CPU / RAM limitations.



I'm in the process of building a Perl script to use DBI and 
lwp::simple::post that will perform this all from a single script 
(instead of 3).




Two specific questions

1: Does anyone have a clever (or better) way to perform 
  
this process 


efficiently?



2: Is there a way to inject into solr without using POST / 
  

curl / http?



Admittedly, I'm no solr expert - I'm starting from someone else's 
setup, trying to reverse-engineer my way out.  Any input would be 
greatly appreciated.



  


  


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Brian Whitman


On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote:




2: Is there a way to inject into solr without using POST / curl /  
http?




Check http://wiki.apache.org/solr/EmbeddedSolr

There's examples in java and cocoa to use the DirectSolrConnection  
class, querying and updating solr w/o a web server. It uses JNI in  
the Cocoa case.

-b



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Clay Webster
If it's a contention between search and indexing, separate  them
via a query-slave and an index-master.

--cw

On 8/9/07, David Whalen [EMAIL PROTECTED] wrote:

 What we're looking for is a way to inject *without* using
 curl, or wget, or any other http-based communication.  We'd
 like for the HTTP daemon to only handle search requests, not
 indexing requests on top of them.

 Plus, I have to believe there's a faster way to get documents
 into solr/lucene than using curl

 _
 david whalen
 senior applications developer
 eNR Services, Inc.
 [EMAIL PROTECTED]
 203-849-7240


  -Original Message-
  From: Clay Webster [mailto:[EMAIL PROTECTED]
  Sent: Thursday, August 09, 2007 11:43 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Any clever ideas to inject into solr? Without http?
 
  Condensing the loader into a single executable sounds right
  if you have performance problems. ;-)
 
  You could also try adding multiple docs in a single post if
  you notice your problems are with tcp setup time, though if
  you're doing localhost connections that should be minimal.
 
  If you're already local to the solr server, you might check
  out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV
  It's a little specialized.
 
  And then there's of course the question of are you doing
  full re-indexing or incremental indexing of changes?
 
  --cw
 
 
  On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
  
   I inherited an existing (working) solr indexing script that
  runs like
   this:
  
  
  
   Python script queries the mysql DB then calls bash script
  
   Bash script performs a curl POST submit to solr
  
  
  
   We're injecting about 1000 records / minute (constantly),
  frequently
   pushing the edge of our CPU / RAM limitations.
  
  
  
   I'm in the process of building a Perl script to use DBI and
   lwp::simple::post that will perform this all from a single script
   (instead of 3).
  
  
  
   Two specific questions
  
   1: Does anyone have a clever (or better) way to perform
  this process
   efficiently?
  
  
  
   2: Is there a way to inject into solr without using POST /
  curl / http?
  
  
  
   Admittedly, I'm no solr expert - I'm starting from someone else's
   setup, trying to reverse-engineer my way out.  Any input would be
   greatly appreciated.
  
  
 



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley
On 8/9/07, David Whalen [EMAIL PROTECTED] wrote:
 Plus, I have to believe there's a faster way to get documents
 into solr/lucene than using curl

One issue with HTTP is latency.  You can get around that by adding
multiple documents per request, or by using multiple threads
concurrently.

You can also bypass HTTP by using something like the CVS loader (very
light weight) and specifying a local file (via stream.file parameter).
http://wiki.apache.org/solr/UpdateCSV
I doubt you will see much of a difference between reading locally vs
streaming over HTTP, but it might be interesting to see the exact
overhead.

-Yonik


RE: Too many open files

2007-08-09 Thread Stu Hood


If you check out the documentation for mergeFactor, you'll find that adjusting 
it downward can lower the number of open files. Just remember that it is a 
speed tradeoff, and only lower it as much as you need to to stop getting the 
too many files errors.

See this section:
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html#indexing_speed

Thanks,
Stu

-Original Message-
From: Ard Schrijvers [EMAIL PROTECTED]
Sent: Thu, August 9, 2007 10:52 am
To: solr-user@lucene.apache.org
Subject: RE: Too many open files

Hello,

useCompoundFile set to true, should avoid the problem. You could also try to 
set maximum open files higher, something like (I assume linux)

ulimit -n 8192

Ard


 
 You're a gentleman and a scholar.  I will donate the MMs to 
 myself :).
 Can you tell me from this snippet of my solrconfig.xml what I might
 tweak to make this more betterer?
 
 -KH
 
   
 default unless
 overridden. --
 false
 10
 1000
 2147483647
 1
 1000
 1
   
 


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley
On 8/9/07, Siegfried Goeschl [EMAIL PROTECTED] wrote:
 +) my colleague just finished a database import service running within
 the servlet container to avoid writing out the data to the file system
 and transmitting it over HTTP.

Most people doing this read data out of the database and construct the
XML in-memory for sending to Solr... one definitely doesn't want to
write intermediate stuff to the filesystem (unless perhaps it's a CSV
dump).

 +) I think there were some discussion regarding a generic database
 importer but nothing I'm aware of

Absolutely a needed feature... it's in the queue:
https://issues.apache.org/jira/browse/SOLR-103

But there will always be more complex cases, pulling from multiple
data sources, doing some merging and munging, etc.  The easiest way to
handle many of those would probably be via a scripting language that
does the app-specific merging+munging and then uses a Solr client
(which constructs in-memory CSV or XML and sends to Solr).

-Yonik


always fail to update the first time after I restart the server

2007-08-09 Thread Xuesong Luo
Hi, 
I noticed the first index update after I restart my jboss server always
fail with the exception below. Any update after that works fine. Does
anyone know what the problem is? The solr version I'm using is solr1.2

Thanks
Xuesong


2007-08-09 11:41:44,559 ERROR [STDERR] Aug 9, 2007 11:41:44 AM
org.apache.solr.core.SolrException log
SEVERE: java.io.IOException: Underlying input stream returned zero bytes
at
sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:415)
at
sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2972)
at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
at
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH
andler.java:111)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd
ateRequestHandler.java:84)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:191)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:159)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:202)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:173)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilte
r.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:202)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:173)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:178)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAs
sociationValve.java:175)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.j
ava:74)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:126)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:105)
at
org.jboss.web.tomcat.tc5.jca.CachedConnectionValve.invoke(CachedConnecti
onValve.java:156)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:107)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1
48)
at
org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:199)
at
org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:282)
at
org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:767)
at
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:
697)
at
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.
java:889)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool
.java:684)
at java.lang.Thread.run(Thread.java:595)
2007-08-09 11:41:44,590 ERROR [STDERR] Aug 9, 2007 11:41:44 AM
org.apache.solr.core.SolrCore execute
INFO: /update/  0 78
2007-08-09 11:41:44,590 ERROR [STDERR] Aug 9, 2007 11:41:44 AM
org.apache.solr.core.SolrException log
SEVERE: java.io.IOException: Underlying input stream returned zero bytes
at
sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:415)
at
sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2972)
at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
at
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH
andler.java:111)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd
ateRequestHandler.java:84)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at

Synonym questions

2007-08-09 Thread Tom Hill
Hi -

Just looking at synonyms, and had a couple of questions.

1) For some of my synonyms, it seems to make senses to simply replace the
original word with the other (e.g. theatre = theater, so searches for
either will find either). For others, I want to add an alternate term while
preserving the original (e.g. cirque = circus, so searches for circus
find Cirque du Soleil, but searches for cirque only match cirque, not
circus.

I was thinking that the best way to do this was with two different synonym
filters. The replace filter would be used both at index and query time, the
other only at index time.

Does doing this using two synonym filters make sense?

section from my schema.xml
fieldType name=text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StandardFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory words=stopwords.txt/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms_replace.txt ignoreCase=true expand=false
includeOrig=false/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms_add.txt ignoreCase=true expand=false
includeOrig=true/
  filter class=solr.EnglishPorterFilterFactory protected=
protwords.txt/
  /analyzer
  analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StandardFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory words=stopwords.txt/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms_replace.txt ignoreCase=true expand=false
includeOrig=false/
  filter class=solr.EnglishPorterFilterFactory protected=
protwords.txt/
  /analyzer
/fieldType

2) For this to work, I need to use includeOrig. It appears that
includeOrig is hard coded to be false in SynonymFilterFactory. Is there
any reason for this? It's pretty easy to change (diff below), any reason
this should not be supported?

Thanks,

Tom

Diffing vs. my local  copy of 1.2, but it appears to be the same in HEAD.

--- src/java/org/apache/solr/analysis/SynonymFilterFactory.java
+++ src/java/org/apache/solr/analysis/SynonymFilterFactory.java (working
copy)
@@ -37,6 +37,7 @@

 ignoreCase = getBoolean(ignoreCase,false);
 expand = getBoolean(expand,true);
+includeOrig = getBoolean(includeOrig,false);

 if (synonyms != null) {
   ListString wlist=null;
@@ -57,8 +58,9 @@
   private SynonymMap synMap;
   private boolean ignoreCase;
   private boolean expand;
+  private boolean includeOrig;

-  private static void parseRules(ListString rules, SynonymMap map, String
mappingSep, String synSep, boolean ignoreCase, boolean expansion) {
+  private void parseRules(ListString rules, SynonymMap map, String
mappingSep, String synSep, boolean ignoreCase, boolean expansion) {
 int count=0;
 for (String rule : rules) {
   // To use regexes, we need an expression that specifies an odd number
of chars.
@@ -88,7 +90,6 @@
 }
   }

-  boolean includeOrig=false;
   for (ListString fromToks : source) {
 count++;
 for (ListString toToks : target) {


Re: Too many open files

2007-08-09 Thread Mike Klaas

On 9-Aug-07, at 7:52 AM, Ard Schrijvers wrote:



ulimit -n 8192


Unless you have an old, creaky box, I highly recommend simply upping  
your filedesc cap.


-Mike


Re: Best use of wildcard searches

2007-08-09 Thread Yonik Seeley
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
 Hmm.. I just tried the following three queries...

 /?q=department_exact:ApparelMen's?
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas...
 (no results)

 /?q=department_exact:ApparelMen's\
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas...
 (no results)

 /?q=ApparelMen's\
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas...
 (results)

 I know that the string I'm searching for is stored in
 department_exact (copyField) and department (original field).

What's the department_exact fieldType look like?

-Yonik


Re: Best use of wildcard searches

2007-08-09 Thread Matthew Runo

Here you go.. I thought that string wasn't munged, so I used that...

field name=department type=text indexed=true stored=true/
field name=department_exact type=string indexed=true  
stored=true/

copyField source=department dest=department_exact/



++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 12:26 PM, Yonik Seeley wrote:


On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:

Hmm.. I just tried the following three queries...

/?q=department_exact:ApparelMen's?
ApparelJackets*fq=country_code:USfq=brand_exact:adidas...
(no results)

/?q=department_exact:ApparelMen's\
ApparelJackets*fq=country_code:USfq=brand_exact:adidas...
(no results)

/?q=ApparelMen's\
ApparelJackets*fq=country_code:USfq=brand_exact:adidas...
(results)

I know that the string I'm searching for is stored in
department_exact (copyField) and department (original field).


What's the department_exact fieldType look like?

-Yonik





Re: Best use of wildcard searches

2007-08-09 Thread Yonik Seeley
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
 Here you go.. I thought that string wasn't munged, so I used that...

 field name=department type=text indexed=true stored=true/
 field name=department_exact type=string indexed=true
 stored=true/
 copyField source=department dest=department_exact/

Hmmm, that looks ok.  You re-indexed since department_exact was added?
If so, could you show the exact XML response containing a document
with department_exact in it, and then a prefix query on
department_exact that doesn't return that query (with debugQuery=on)?

-Yonik


RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Kevin Holmes
Is this a native feature, or do we need to get creative with scp from
one server to the other?


If it's a contention between search and indexing, separate  them
via a query-slave and an index-master.

--cw


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley
On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
 Python script queries the mysql DB then calls bash script

 Bash script performs a curl POST submit to solr

For the most up-to-date solr client for python, check out
https://issues.apache.org/jira/browse/SOLR-216

-Yonik


Re: Best use of wildcard searches

2007-08-09 Thread Matthew Runo

Yes, we've reindexed several times. Here are three sample result sets..

1 - ?q=department_exact:ApparelMen's? 
ApparelJackets*fq=country_code:USfq=brand_exact:adidas
2 - ?q=department_exact:ApparelMen's\  
ApparelJackets*fq=country_code:USfq=brand_exact:adidas
3 - ?q=department_exact:ApparelMen's  
ApparelJackets*fq=country_code:USfq=brand_exact:adidas
4 - ?q=ApparelMen's  
ApparelJackets*fq=country_code:USfq=brand_exact:adidas


1 is the only one that has any data now.. very strange that it'd  
change when I changed nothing in the index. But at any rate,  
shouldn't ? and \  give the same results?


Also attached is my schema.xml.
++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint name=QTime2/intlst name=paramsstr name=qdepartment_exact:Apparelgt;Men's?Apparelgt;Jackets*/strarr name=fqstrcountry_code:US/strstrbrand_exact:adidas/str/arr/lst/lstresult name=response numFound=35 start=0docstr name=brandadidas/strstr name=brand_exactadidas/strint name=brand_id1/intarr name=country_codestrUS/str/arrdate name=create_date2007-08-05T01:08:26Z/datestr name=departmentApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=department_exactApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=descriptionlt;ulgt;lt;ligt;The Essential 3-Stripes Track Top is a ClimaLiteamp;reg; full-zip jacket with front pockets.lt;ligt;Applied 3-Stripes on sleeves.lt;ligt;3-D inserts at chest and back.lt;ligt;Heat-transfer brandmark on left chest.lt;ligt;100% Polyester.lt;/ulgt;/strstr name=genderMens/strstr name=hibernatedtrue/strbool name=inStocktrue/boolstr name=nameEssential 3-Stripes Track Top/strint name=popularity20018/intfloat name=price55.95/floatint name=product_id7280433/intarr name=product_typestrJacket/strstrTop/strstrApparel/str/arrarr name=product_type_exactstrJacket/strstrTop/strstrApparel/str/arrstr name=product_url/n/p/p/7280433/c/151.html/strarr name=sizestrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrarr name=size_exactstrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrint name=style_id333625/intstr name=thumbnail_url/images/728/7280433/6220-333625-t.jpg/strarr name=widthstrRegular/str/arr/docdocstr name=brandadidas/strstr name=brand_exactadidas/strint name=brand_id1/intarr name=country_codestrUS/str/arrdate name=create_date2007-08-05T01:08:26Z/datestr name=departmentApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=department_exactApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=descriptionlt;ulgt;lt;ligt;The Essential 3-Stripes Track Top is a ClimaLiteamp;reg; full-zip jacket with front pockets.lt;ligt;Applied 3-Stripes on sleeves.lt;ligt;3-D inserts at chest and back.lt;ligt;Heat-transfer brandmark on left chest.lt;ligt;100% Polyester.lt;/ulgt;/strstr name=genderMens/strstr name=hibernatedtrue/strbool name=inStocktrue/boolstr name=nameEssential 3-Stripes Track Top/strint name=popularity20018/intfloat name=price55.95/floatint name=product_id7280433/intarr name=product_typestrJacket/strstrTop/strstrApparel/str/arrarr name=product_type_exactstrJacket/strstrTop/strstrApparel/str/arrstr name=product_url/n/p/p/7280433/c/6758.html/strarr name=sizestrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrarr name=size_exactstrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrint name=style_id333629/intstr name=thumbnail_url/images/728/7280433/6220-333629-t.jpg/strarr name=widthstrRegular/str/arr/docdocstr name=brandadidas/strstr name=brand_exactadidas/strint name=brand_id1/intarr name=country_codestrUS/str/arrdate name=create_date2007-08-05T01:08:26Z/datestr name=departmentApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=department_exactApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=descriptionlt;ulgt;lt;ligt;The Essential 3-Stripes Track Top is a ClimaLiteamp;reg; full-zip jacket with front pockets.lt;ligt;Applied 3-Stripes on sleeves.lt;ligt;3-D inserts at chest and back.lt;ligt;Heat-transfer brandmark on left chest.lt;ligt;100% Polyester.lt;/ulgt;/strstr name=genderMens/strstr name=hibernatedtrue/strbool name=inStocktrue/boolstr name=nameEssential 3-Stripes Track Top/strint name=popularity20018/intfloat name=price55.95/floatint 

Re: Best use of wildcard searches

2007-08-09 Thread Yonik Seeley
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
 Yes, we've reindexed several times. Here are three sample result sets..

 1 - ?q=department_exact:ApparelMen's?
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas
 2 - ?q=department_exact:ApparelMen's\
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas
 3 - ?q=department_exact:ApparelMen's
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas
 4 - ?q=ApparelMen's
 ApparelJackets*fq=country_code:USfq=brand_exact:adidas

 1 is the only one that has any data now.. very strange that it'd
 change when I changed nothing in the index. But at any rate,
 shouldn't ? and \  give the same results?

They translate to different queries.
But can I see the XML output for 1 and 2 with debugQuery=onindent=on appended?

-Yonik


Re: Best use of wildcard searches

2007-08-09 Thread Yonik Seeley
On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote:
 They translate to different queries.
 But can I see the XML output for 1 and 2 with debugQuery=onindent=on 
 appended?

Or perhaps with wt=python would be less confusing seeing that there
are '' chars in there that would otherwise be escaped.

-Yonik


Re: Best use of wildcard searches

2007-08-09 Thread Matthew Runo

Sure thing!

Heres 1, and 2.

1 - just a space.
2 - a \ .

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++



On Aug 9, 2007, at 1:14 PM, Yonik Seeley wrote:


On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote:

They translate to different queries.
But can I see the XML output for 1 and 2 with  
debugQuery=onindent=on appended?


Or perhaps with wt=python would be less confusing seeing that there
are '' chars in there that would otherwise be escaped.

-Yonik





Re: Best use of wildcard searches

2007-08-09 Thread Matthew Runo
Hm, I don't see any attachments, I'm forwarding them to you directly.  
Would anyone else like to see them?


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 1:20 PM, Matthew Runo wrote:


Sure thing!

Heres 1, and 2.

1 - just a space.
2 - a \ .

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 1:14 PM, Yonik Seeley wrote:


On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote:

They translate to different queries.
But can I see the XML output for 1 and 2 with  
debugQuery=onindent=on appended?


Or perhaps with wt=python would be less confusing seeing that there
are '' chars in there that would otherwise be escaped.

-Yonik







Re: Best use of wildcard searches

2007-08-09 Thread Matthew Runo
Feel free to run some queries yourself. We opened the firewall for  
this box...


http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 
3EMen's\%20Apparel% 
3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python




++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 1:21 PM, Matthew Runo wrote:

Hm, I don't see any attachments, I'm forwarding them to you  
directly. Would anyone else like to see them?


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 1:20 PM, Matthew Runo wrote:


Sure thing!

Heres 1, and 2.

1 - just a space.
2 - a \ .

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 1:14 PM, Yonik Seeley wrote:


On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote:

They translate to different queries.
But can I see the XML output for 1 and 2 with  
debugQuery=onindent=on appended?


Or perhaps with wt=python would be less confusing seeing that there
are '' chars in there that would otherwise be escaped.

-Yonik









Re: Best use of wildcard searches

2007-08-09 Thread Yonik Seeley
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
 Feel free to run some queries yourself. We opened the firewall for
 this box...

 http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel%
 3EMen's\%20Apparel%
 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python

OK, so this query is returning results, right?
So what query isn't returning any results (but should) now?

-Yonik


Re: Best use of wildcard searches

2007-08-09 Thread Matthew Runo
http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 
3EMen's%20Apparel% 
3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python


The same exact query, with... wait..

Wow. I'm making myself look like an idiot.

I swear that these queries didn't work the first time I ran them...

But now \  and ? give the same results, as would be expected,  
while   returns nothing.


I'm sorry for wasting your time, but I do appreciate the help!

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 9, 2007, at 1:35 PM, Yonik Seeley wrote:


On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:

Feel free to run some queries yourself. We opened the firewall for
this box...

http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel%
3EMen's\%20Apparel%
3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python


OK, so this query is returning results, right?
So what query isn't returning any results (but should) now?

-Yonik





Re: Best use of wildcard searches

2007-08-09 Thread Yonik Seeley
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
 http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel%
 3EMen's%20Apparel%
 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python

 The same exact query, with... wait..

 Wow. I'm making myself look like an idiot.

 I swear that these queries didn't work the first time I ran them...

 But now \  and ? give the same results, as would be expected,
 while   returns nothing.

 I'm sorry for wasting your time, but I do appreciate the help!

lo - these things can happen when you get too many levels of escaping needed.
Hopefully we can improve the situation in the future to get rid of the
query parser escaping for certain queries such as prefix and term.

-Yonik


Creating a document blurb when nothing is returned from highlight feature

2007-08-09 Thread Benjamin Higgins
Hi all, I'd like to provide a blurb of documents matching a search in
the case when there is no text highlighted.  I assumed that perhaps the
highlighter would give me back the first few words in a document if this
occurred, but it doesn't.  My conundrum is that I'd rather not grab the
whole document body field because some of them are large.  Is there some
way I can request from Lucene the first N words or lines from a field?

 

Thanks.

 

Ben



Re: Creating a document blurb when nothing is returned from highlight feature

2007-08-09 Thread Yonik Seeley
On 8/9/07, Benjamin Higgins [EMAIL PROTECTED] wrote:
 Hi all, I'd like to provide a blurb of documents matching a search in
 the case when there is no text highlighted.  I assumed that perhaps the
 highlighter would give me back the first few words in a document if this
 occurred, but it doesn't.  My conundrum is that I'd rather not grab the
 whole document body field because some of them are large.  Is there some
 way I can request from Lucene the first N words or lines from a field?

Hmmm, that does sound like a reasonable thing, and I guess it belongs
in/with the highlighter functionallity?

Could you open a JIRA issue to track this?

-Yonik


Re: Creating a document blurb when nothing is returned from highlight feature

2007-08-09 Thread Mike Klaas

On 9-Aug-07, at 2:10 PM, Benjamin Higgins wrote:


Hi all, I'd like to provide a blurb of documents matching a search in
the case when there is no text highlighted.  I assumed that perhaps  
the
highlighter would give me back the first few words in a document if  
this
occurred, but it doesn't.  My conundrum is that I'd rather not grab  
the
whole document body field because some of them are large.  Is there  
some

way I can request from Lucene the first N words or lines from a field?


The way I deal with this is that I modified the highlighter fragment  
scorer to return a positive (but low) score for the first few  
fragments of a doc.  This will work, but tends not to provide great  
summaries and will definitely still fetch and process the entire doc  
contents.


The better way to do this is to generate a better general summary  
yourself and store it in a separate field; this can be used if no  
highlighting is generated (or, capability in Solr to automatically  
substitute a field in the case of no highlighting would be cool).  I  
might even implement this if there is sufficient interest :).


Unfortunately, the highlighter does not know (and realy has no way of  
knowing) what parts of a doc matched, so it would still have to try  
highlighting first.


Note that you can control the cpu usage for long fields by setting  
hl.maxAnalyzedChars (will be in the next release).


best,
-Mike


Returning a list of matching words

2007-08-09 Thread Thiago Jackiw
This may be obvious but I can't get my head straight. Is there a way
to return a list of matching words that a record got matched against?
For instance:

record_a: ruby, solr, mysql, rails
record_b: solr, java

Then ?q=solr+OR+rails would return the matched words for the records

record_a: solr, rails
record_b: solr

I'm not looking into using the highlight feature for that.

Thanks,

--
Thiago Jackiw


Multivalued fields and the 'copyField' operator

2007-08-09 Thread Lance Norskog
I'm adding a field to be the source of the spellcheck database.  Since that
is its only job, it has raw text lower-cased, de-Latin1'd, and
de-duplicated.
 
Since it is only for the spellcheck DB, it does not need to keep duplicates.
I specified it as 'multiValued=false and used copyField from a few other
fields to populate it. The Analyser promptly blew up, claiming that I was
putting multiple values in a single-valued field. I changed it to
multiValued=true, but now it keeps separate copies of the different
fields, which usually overlap.
 
Would it make sense for multiple copyField operations to work with a
single-valued field?  Since single-valued fields are a new feature in Solr,
I assume these little corner cases have not come to light before.
 
I defer to The Wisdom Of The Elder Council.
 
Thanks,
 
Lance


RE: Creating a document blurb when nothing is returned from highlight feature

2007-08-09 Thread Benjamin Higgins
Thanks Mike.  I didn't think of creating a blurb beforehand, but that's
a great solution.  I'll probably do that.  Yonik, I can still add a JIRA
issue if you'd like, though.

Ben

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 2:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Creating a document blurb when nothing is returned from
highlight feature

On 9-Aug-07, at 2:10 PM, Benjamin Higgins wrote:

 Hi all, I'd like to provide a blurb of documents matching a search in
 the case when there is no text highlighted.  I assumed that perhaps  
 the
 highlighter would give me back the first few words in a document if  
 this
 occurred, but it doesn't.  My conundrum is that I'd rather not grab  
 the
 whole document body field because some of them are large.  Is there  
 some
 way I can request from Lucene the first N words or lines from a field?

The way I deal with this is that I modified the highlighter fragment  
scorer to return a positive (but low) score for the first few  
fragments of a doc.  This will work, but tends not to provide great  
summaries and will definitely still fetch and process the entire doc  
contents.

The better way to do this is to generate a better general summary  
yourself and store it in a separate field; this can be used if no  
highlighting is generated (or, capability in Solr to automatically  
substitute a field in the case of no highlighting would be cool).  I  
might even implement this if there is sufficient interest :).

Unfortunately, the highlighter does not know (and realy has no way of  
knowing) what parts of a doc matched, so it would still have to try  
highlighting first.

Note that you can control the cpu usage for long fields by setting  
hl.maxAnalyzedChars (will be in the next release).

best,
-Mike


Is it possible to know from where in the field highlighed text comes from?

2007-08-09 Thread Benjamin Higgins
Hi again,

 

It'd be nice to know what the starting line number is for highlighted
snippets.  I imagine others might find it useful to know the starting
byte offset.  Is there an easy way to add this in?  I'm not afraid of
hacking the source if it's not too involved.

 

Thanks.

 

Ben



tomcat and solr multiple instances

2007-08-09 Thread Jae Joo
Hi,

 

I have built 2 solr instance - one is example and the other is
ca_companies.

 

The ca_companies solr instance is working find, but example is not
working...

 

In the admin page, /solr/admin, for example instance, it shows that

 

Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf  

-- this should be 

Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf  

 

SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/

 

Any one knows why?

 

If I run Jetty for instance example, it is working well...

 

Thanks,

 

Jae Joo



EmbeddedSolr and optimize

2007-08-09 Thread Sundling, Paul
http://wiki.apache.org/solr/EmbeddedSolr
 
Following the example on connecting to the Index directly without using
HTTP, I tried to optimize by passing the true flag to the
CommitUpdateCommand.
 
When optimizing an index with Lucene directly it doubles the size of the
index temporarily and then deletes the old segments that were optimized.
Instead, what happened was the old segments were still there.  Calling
optimize a second time did remove the old segments.
 
Lucene it's usually 
writer.optimize();
writer.close();
 
So what method call do I need to make afterwards so I don't have to call
optimize a second time with the Solr API?
 
public void index() {
//do stuff 
while (loop) {
  //add millions of documents and commit at intervals
}
optimize(); // optimize to reduce file handles
optimize();//clean up old segments which still existed
//WHAT SHOULD BE HERE INSTEAD OF ANOTHER OPTIMIZE?
}
 
public void commit() throws IOException {
commit(false);
}
 
public void optimize() throws IOException {
logger.info(Optimizing an index temporarily doubles size of index,

+  but reduces number of files );
commit(true);
}
 
private static void commit(boolean optimize) throws IOException {
UpdateHandler updateHandler = core.getUpdateHandler(); 
CommitUpdateCommand commitcmd = new CommitUpdateCommand(optimize);
updateHandler.commit(commitcmd);
}
 
Paul Sundling



RE: tomcat and solr multiple instances

2007-08-09 Thread Jae Joo
Here are the Catalina/localhost/ files
For example instance
Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war
 debug=0 crossContext=true
Environment name=solr/home type=java.lang.String
value=/rpt/src/apache-solr-1.2.0/example/solr
override=true /

/Context


For ca_companies instance

Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war
 debug=0 crossContext=true
Environment name=solr/home type=java.lang.String
value=/rpt/src/apache-solr-1.2.0/ca_companies/solr
override=true /

/Context


Urls
http://host:8080/solr/admin -- pointint example instance (Problem...)
http://host:8080/solr_ca/admin -- pointing ca-companies instance (it
is working)

-Original Message-
From: Jae Joo [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 5:45 PM
To: solr-user@lucene.apache.org
Subject: tomcat and solr multiple instances

Hi,

 

I have built 2 solr instance - one is example and the other is
ca_companies.

 

The ca_companies solr instance is working find, but example is not
working...

 

In the admin page, /solr/admin, for example instance, it shows that

 

Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf  

-- this should be 

Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf  

 

SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/

 

Any one knows why?

 

If I run Jetty for instance example, it is working well...

 

Thanks,

 

Jae Joo



Re: Returning a list of matching words

2007-08-09 Thread Yonik Seeley
On 8/9/07, Thiago Jackiw [EMAIL PROTECTED] wrote:
 This may be obvious but I can't get my head straight. Is there a way
 to return a list of matching words that a record got matched against?

Unfortunately no... lucene doesn't provide that capability with
standard queries.
You could do it (slower) with additional queries of course:

q=solr OR railsrows=5  // retrieves the top docs
q=solr OR railsfq=solrfl=id  // see which top docs match solr
q=solr OR railsfq=railsf=id  // see which top docs match rails

-Yonik


RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Lance Norskog
Jython is a Python interpreter implemented in Java. (I have a lot of Python
code.)

Total throughput in the servlet is very sensitive to the total number of
servlet sockets available v.s. the number of CPUs.

The different analysers have very different performance.

You might leave some data in the DB, instead of storing it all in the index.

Underlying this all, you have a sneaky network performance problem. Your
successive posts do not reuse a TCP socket. Obvious: re-opening a new socket
each post takes time. Not obvious: your server has sockets building up in
TIME_WAIT state.  (This means the sockets are shutting down. Having both
ends agree to close the connection is metaphysically difficult. The TCP/IP
spec even has a bug in this area.) Sockets building up can use TCP resources
to run low or may run out. Your kernel configuration may be weak in this
area.

Lance

-Original Message-
From: Kevin Holmes [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 8:13 AM
To: solr-user@lucene.apache.org
Subject: Any clever ideas to inject into solr? Without http?

I inherited an existing (working) solr indexing script that runs like
this:

 

Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 

We're injecting about 1000 records / minute (constantly), frequently pushing
the edge of our CPU / RAM limitations.

 

I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script (instead
of 3).

 

Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 

2: Is there a way to inject into solr without using POST / curl / http?

 

Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.




Re: Multivalued fields and the 'copyField' operator

2007-08-09 Thread Yonik Seeley
On 8/9/07, Lance Norskog [EMAIL PROTECTED] wrote:
 I'm adding a field to be the source of the spellcheck database.  Since that
 is its only job, it has raw text lower-cased, de-Latin1'd, and
 de-duplicated.

 Since it is only for the spellcheck DB, it does not need to keep duplicates.

Duplicate token values (words) or duplicate field values?
Could you give some examples?

-Yonik


Re: tomcat and solr multiple instances

2007-08-09 Thread Pieter Berkel
The current working directory (Cwd) is the directory from which you started
the Tomcat server and is not dependent on the Solr instance configurations.
So as long as SolrHome is correct for each Solr instance, you shouldn't have
a problem.

cheers,
Piete



On 10/08/07, Jae Joo [EMAIL PROTECTED] wrote:

 Here are the Catalina/localhost/ files
 For example instance
 Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war
  debug=0 crossContext=true
 Environment name=solr/home type=java.lang.String
 value=/rpt/src/apache-solr-1.2.0/example/solr
 override=true /

 /Context


 For ca_companies instance

 Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war
  debug=0 crossContext=true
 Environment name=solr/home type=java.lang.String
 value=/rpt/src/apache-solr-1.2.0/ca_companies/solr
 override=true /

 /Context


 Urls
 http://host:8080/solr/admin -- pointint example instance (Problem...)
 http://host:8080/solr_ca/admin -- pointing ca-companies instance (it
 is working)

 -Original Message-
 From: Jae Joo [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 09, 2007 5:45 PM
 To: solr-user@lucene.apache.org
 Subject: tomcat and solr multiple instances

 Hi,



 I have built 2 solr instance - one is example and the other is
 ca_companies.



 The ca_companies solr instance is working find, but example is not
 working...



 In the admin page, /solr/admin, for example instance, it shows that



 Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf

 -- this should be

 Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf



 SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/



 Any one knows why?



 If I run Jetty for instance example, it is working well...



 Thanks,



 Jae Joo




Re: Creating a document blurb when nothing is returned from highlight feature

2007-08-09 Thread Sean Timm
It should probably be configurable: (1) return nothing if no match, (2) 
substitute with an alternate field, (3) return first sentence or N 
number of tokens.

-Sean

Yonik Seeley wrote on 8/9/2007, 5:50 PM:

  On 8/9/07, Benjamin Higgins [EMAIL PROTECTED] wrote:
   Thanks Mike.  I didn't think of creating a blurb beforehand, but that's
   a great solution.  I'll probably do that.  Yonik, I can still add a
  JIRA
   issue if you'd like, though.
 
  Always 10 different ways to tackle the same problem in the search
  space, and that's why it's great to have a lot of people around for
  different ideas/approaches.
 
  I do think opening a JIRA issue would be worth it, even if Mike's
  approach yields superior results.  It seems like a reasonable
  expectation to always get something back as a document summary without
  having to create a specific field for that.
 
  -Yonik
 




Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Norberto Meijome
On Thu, 9 Aug 2007 15:23:03 -0700
Lance Norskog [EMAIL PROTECTED] wrote:

 Underlying this all, you have a sneaky network performance problem. Your
 successive posts do not reuse a TCP socket. Obvious: re-opening a new socket
 each post takes time. Not obvious: your server has sockets building up in
 TIME_WAIT state.  (This means the sockets are shutting down. Having both
 ends agree to close the connection is metaphysically difficult. The TCP/IP
 spec even has a bug in this area.) Sockets building up can use TCP resources
 to run low or may run out. Your kernel configuration may be weak in this
 area.

Good point. and putting my pedantic hat on here, it may not necessarily be 
'kernel configuration', but network stack - not sure what OS the OP is using.
B
_
{Beto|Norberto|Numard} Meijome

All parts should go together without forcing. You must remember that the parts 
you are reassembling were disassembled by you.
 Therefore, if you can't get them together again, there must be a reason. 
 By all means, do not use hammer.
   IBM maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


RE: Multivalued fields and the 'copyField' operator

2007-08-09 Thread Lance Norskog
If we have a field spellcheck_db, and have two copyField lines for it:

fieldType name=spellcheck ... Basically the text type without
stemming... 

field name=title type=string / 
field name=description type=string / 

field name=spellcheck_db multiValued=false
type=spellcheck indexed=true stored=false
required=true /

copyField source=title dest=spellcheck_db /
copyField source=description dest=spellcheck_db /

All I want to do is make a pile of words as input to the spellcheck feature.

If I index with this, the spellcheck Analyser class complains that I'm
putting two values in a multiValued=false field. Since I have to make it
multiValued, the same word in successive values is not collapsed into one
mention of the word.

I suppose this is an 'out' case, and not worth any major internal rework.

Thanks for your time,

Lance

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Thursday, August 09, 2007 5:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Multivalued fields and the 'copyField' operator

On 8/9/07, Lance Norskog [EMAIL PROTECTED] wrote:
 I'm adding a field to be the source of the spellcheck database.  Since 
 that is its only job, it has raw text lower-cased, de-Latin1'd, and 
 de-duplicated.

 Since it is only for the spellcheck DB, it does not need to keep
duplicates.

Duplicate token values (words) or duplicate field values?
Could you give some examples?

-Yonik



Re: [newbie] how to debug the schema?

2007-08-09 Thread Franz Allan Valencia See
Good day,

danc86 of #lucene gave me the answer - I was not storing the fields :-)

Thanks,
Franz

On 8/9/07, Ryan McKinley [EMAIL PROTECTED] wrote:

 
  [QUESTION]
  What could be the problem? .Or what else can I do to debug this problem?
 

 In general 'luke' is a great tool to figure out what may be happening in
 the index.

 (assuming you are running 1.2) check your schema fields from:
 http://localhost:8983/solr/admin/luke?show=schema

 Using http://www.getopt.org/luke/ is also very useful.





RE: Best use of wildcard searches

2007-08-09 Thread Jonathan Woods
Maybe there's a different way, in which path-like values like this are
treated explicitly.

I use a similar approach to Matthew at www.colfes.com, where all pages are
generated from Lucene searches according to filters on a couple of
hierarchical categories ('spaces'), i.e. subject and organisational unit.
From that experience, a few things occur to me here:

1.  The structure of any particular category/space is not immediately
derivable from data, so unless we're Google or doing something RDF-like
they're something you define up front.  For this reason, and because it
makes internationalisation easier, I feel you should model this kind of
standing data independently of its representation.

So instead searching for DepartmentsMen's ApparelJackets, I index (and
search for) a String /departments/mensapparel/jackets/, and used a simple
standing data mapping to resolves each of the nodes along the path to a
human-readable form when necessary.  In my case, the values for any
particular resource (e.g. a news article) are defined by CMS users from
drop-downs.

2.  In my Lucene library, I redundantly indexed paths like
/departments/mensapparel/jackets/ into successive fragments, together with
the whole path value:

/departments
/departments/mensapparel
/departments/mensapparel/jackets
/departments/mensapparel/jackets/

using my own PathAnalyzer (extends Analyzer, of course) which makes it very
fast to query on path fragments: all goods anywhere in the men's apparel
section - query on /departments/mensapparel; all goods categorised as
exactly in the men's apparel section - query on
/departments/mensapparel/.

I implemented all queries like this as filters, and cached the filter
definitions.  I guess Solr's query optimisation and filter caching do all
this out of the box, so it may end up being just as fast to use the kind of
PrefixQuery suggested in this thread.

3.  However, I can post/attach/donate PathAnalyzer if anyone thinks it might
still be useful.  I started off calling it HierarchyValueAnalyzer, then
TreeNodePathAnalyzer, but now that it's PathAnalyzer I cna't help thinking
it might have lots of applications

Jon 

 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
 Sent: 09 August 2007 21:50
 To: solr-user@lucene.apache.org
 Subject: Re: Best use of wildcard searches
 
 On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
  http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel%
  3EMen's%20Apparel%
  3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python
 
  The same exact query, with... wait..
 
  Wow. I'm making myself look like an idiot.
 
  I swear that these queries didn't work the first time I ran them...
 
  But now \  and ? give the same results, as would be expected, 
  while   returns nothing.
 
  I'm sorry for wasting your time, but I do appreciate the help!
 
 lo - these things can happen when you get too many levels of 
 escaping needed.
 Hopefully we can improve the situation in the future to get 
 rid of the query parser escaping for certain queries such as 
 prefix and term.