Solr commit takes too long

2007-09-10 Thread Marius Hanganu

Hi,

We're having a problem when commiting to SOLR.

Our application commits right after each update - we need the data to be 
available instantaneously. The index' size is about 166M, Solr has 1024M 
on a dual quad.


The update takes a few milliseconds, but the commit takes about 1 minute.

Could you please recommend what we should check for? Or perhaps some 
tuning parameters?


Thanks,
Marius


Re: caching query result

2007-09-10 Thread Jae Joo
Here is the response XML faceted by multiple fields including state.
response
−
lst name=responseHeader
int name=status0/int
int name=QTime1782/int
−
lst name=params
str name=facet.limit-1/str
str name=wt/
str name=rows10/str
str name=start0/str
str name=sortscore desc/str
str name=facettrue/str
str name=facet.mincount1/str
−
str name=fl
duns_number,company_name,phys_state, phys_city, score
/str
str name=qphys_country:United States/str
str name=qt/
str name=version2.2/str
str name=explainOther/
str name=hl.fl/
−
arr name=facet.field
strsales_range/str
strtotal_emp_range/str
strcompany_type/str
strphys_state/str
strsic1/str
/arr
str name=indenton/str
/lst
/lst

On 9/6/07, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 9/6/07, Jae Joo [EMAIL PROTECTED] wrote:
  I have 13 millions and have facets by states (50). If there is a
 mechasim to
  chche, I may get faster result back.

 How fast are you getting results back with standard field faceting
 (facet.field=state)?



RE: Solr and KStem

2007-09-10 Thread Wagner,Harry
Yes, I don't think the licensing will be a problem as KStem already
includes a wrapper for Lucene.

Cheers!
harry

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 07, 2007 4:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr and KStem

Look for KStem in Lucene JIRA.  Mny years ago something KStem
related was contributed, and there was a discussion about licenses then.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Walter Underwood [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, September 7, 2007 4:31:25 PM
Subject: Re: Solr and KStem

Even if KStem isn't ASL, we could include the plug-in code
with notes about how to get the stemmer. Or, the Solr plug-in
could be contributed to the group that manages the KStem
distribution:

  http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi

wunder

On 9/7/07 12:59 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote:
 I've implemented a Solr plug-in that wraps KStem for Solr use.  KStem
is
 considered to be more appropriate for library usage since it is much
 less aggressive than Porter (i.e., searches for organization do NOT
 match on organ!). If there is any interest in feeding this back into
 Solr I would be happy to contribute it.
 
 Absolutely.
 We need to make sure that the license for that k-stemmer is ASL
 compatible of course.
 
 -Yonik






quirks with sorting

2007-09-10 Thread David Whalen
Hi All.

I'm seeing a weird problem with sorting that I can't figure out.

I have a query that uses two fields -- a source column and a
date column.  I search on the source and I sort by the date
descending.

What I'm seeing is that depending on the value in the source,
the date sort works in reverse.

For example, the query:

content_source:(mv); content_date desc

returns 2007-09-10T09:25:00.000Z in its first row, which is what
I expect.

BUT, the query:

content_source:(thomson); content_date desc

returns 2008-08-17T00:00:00.000Z, which is the first date we
put into SOLR.

So, simply by changing the value in the field, the sort seems
to beem reversed (or ignored outright).

Now, before you ask, I did a sanity-check query to make sure
that there is in fact data for that source from today, and there
is.

Can anyone help shed some light on this?

TIA

DW


Re: quirks with sorting

2007-09-10 Thread Yonik Seeley
On 9/10/07, David Whalen [EMAIL PROTECTED] wrote:
 I'm seeing a weird problem with sorting that I can't figure out.

 I have a query that uses two fields -- a source column and a
 date column.  I search on the source and I sort by the date
 descending.

 What I'm seeing is that depending on the value in the source,
 the date sort works in reverse.

 For example, the query:

 content_source:(mv); content_date desc

 returns 2007-09-10T09:25:00.000Z in its first row, which is what
 I expect.

 BUT, the query:

 content_source:(thomson); content_date desc

 returns 2008-08-17T00:00:00.000Z, which is the first date we
 put into SOLR.

It is it the last (highest date) since it's 2008?

-Yonik


RE: quirks with sorting

2007-09-10 Thread David Whalen
red-faced

You know, I must have looked at that date 10 times and I never
noticed the year.

Sorry everyone!

/red-faced

  

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf 
 Of Yonik Seeley
 Sent: Monday, September 10, 2007 11:23 AM
 To: solr-user@lucene.apache.org
 Subject: Re: quirks with sorting
 
 On 9/10/07, David Whalen [EMAIL PROTECTED] wrote:
  I'm seeing a weird problem with sorting that I can't figure out.
 
  I have a query that uses two fields -- a source column and a date 
  column.  I search on the source and I sort by the date descending.
 
  What I'm seeing is that depending on the value in the 
 source, the date 
  sort works in reverse.
 
  For example, the query:
 
  content_source:(mv); content_date desc
 
  returns 2007-09-10T09:25:00.000Z in its first row, which is what I 
  expect.
 
  BUT, the query:
 
  content_source:(thomson); content_date desc
 
  returns 2008-08-17T00:00:00.000Z, which is the first date 
 we put into 
  SOLR.
 
 It is it the last (highest date) since it's 2008?
 
 -Yonik
 


Re: Distribution Information?

2007-09-10 Thread Bill Au
I guess your solr home isn't configured correctly.  FYI, you can set
master_status_dir to use full path name (ie /opt/solr/logs/clients in your
case).

Bill

On 9/7/07, Matthew Runo [EMAIL PROTECTED] wrote:

 OK. I made the change, but it seemed not to pick up the files.

 When I changed distrobutiondump.jsp to say...

 File masterdir = new File(/opt/solr/logs/clients);

 it worked. Thank you for your help!

 ++
   | Matthew Runo
   | Zappos Development
   | [EMAIL PROTECTED]
   | 702-943-7833
 ++


 On Sep 7, 2007, at 2:21 PM, Bill Au wrote:

  I just double checked distribution.jsp.  The directory where it
  looks for
  status files is hard coded to logs/clients.  So for now
  master_status_dir in
  your solr/conf/scripts.conf has to be set to that so the scripts
  will put
  the status files there.  It looks like they are currently in you logs
  directory.  The status files are snapshot.current.search2 and
  snapshot.status.search2.
 
  Bill
 
  On 9/7/07, Matthew Runo [EMAIL PROTECTED] wrote:
 
  Actually I don't have the clients directory...
 
  [EMAIL PROTECTED]: .../logs]$ pwd
  /opt/solr/logs
  [EMAIL PROTECTED]: .../logs]$ ls
  rsyncd-enabled  rsyncd.log  rsyncd.pid  snapcleaner.log
  snapshooter.log  snapshot.current.search2  snapshot.status.search2
  [EMAIL PROTECTED]: .../logs]$
 
 
  It does look like it could be a path issue. I wonder why, though, no
  clients sub directory was created.
 
  ++
| Matthew Runo
| Zappos Development
| [EMAIL PROTECTED]
| 702-943-7833
  ++
 
 
  On Sep 7, 2007, at 7:43 AM, Bill Au wrote:
 
  I that case, definitely take a look at SOLR-333:
 
  http://issues.apache.org/jira/browse/SOLR-333
 
  On the master there should be a logs/clients directory.  Do you
  have any
  files in there?
 
  Bill
 
  On 9/6/07, Matthew Runo [EMAIL PROTECTED] wrote:
 
  Well, I do get...
 
  Distribution Info
  Master Server
 
  No distribution info present
 
  ...
 
  But there appears to be no information filled in.
 
  ++
| Matthew Runo
| Zappos Development
| [EMAIL PROTECTED]
| 702-943-7833
  ++
 
 
  On Sep 6, 2007, at 6:09 AM, Bill Au wrote:
 
  That is very strange.  Even if there is something wrong with the
  config or
  code, the static HTML contained in distributiondump.jsp should
  show
  up.
 
  Are you using the latest version of the JSP?  There has been a
  recent fix:
 
  http://issues.apache.org/jira/browse/SOLR-333
 
  Bill
 
  On 9/5/07, Matthew Runo [EMAIL PROTECTED] wrote:
 
  When I load the distrobutiondump.jsp, there is no output in my
  catalina.out file.
 
  ++
| Matthew Runo
| Zappos Development
| [EMAIL PROTECTED]
| 702-943-7833
  ++
 
 
  On Sep 5, 2007, at 1:55 PM, Matthew Runo wrote:
 
  Not that I've noticed. I'll do a more careful grep soon here - I
  just got back from a long weekend.
 
  ++
   | Matthew Runo
   | Zappos Development
   | [EMAIL PROTECTED]
   | 702-943-7833
  ++
 
 
  On Aug 31, 2007, at 6:12 PM, Bill Au wrote:
 
  Are there any error message in your appserver log files?
 
  Bill
 
  On 8/31/07, Matthew Runo [EMAIL PROTECTED] wrote:
  Hello!
 
  /solr/admin/distributiondump.jsp
 
  This server is set up as a master server, and other servers
  use
  the
  replication scripts to pull updates from it every few
  minutes. My
  distribution information screen is blank.. and I couldn't
  find any
  information on fixing this in the wiki.
 
  Any chance someone would be able to explain how to get this
  page
  working, or what I'm doing wrong?
 
  ++
| Matthew Runo
| Zappos Development
| [EMAIL PROTECTED]
| 702-943-7833
  ++
 
 
 
 
 
 
 
 
 
 
 




Re: My Solr index keeps growing

2007-09-10 Thread Yonik Seeley
On 9/10/07, Robin Bonin [EMAIL PROTECTED] wrote:
 I had created a new index over the weekend, and the final size was a
 few hundred megs.
 I just checked and now the index folder is up to 1.7 Gig. Is this due
 to results being cached? can I set a limit to how large the index will
 grow? is there anything else that could be effecting this file size?

index normally refers to the index files on the disk... is this what you mean?
If so, it shouldn't grow unless new documents are added.

-Yonik


Re: My Solr index keeps growing

2007-09-10 Thread Robin Bonin
Yes I am talking about the files in the solr/data/index folder.
So that folder should stay the same size unless documents are added,
and I guess commit and optimize are run.

I'll have to watch my app and make sure it is not adding some extra
stuff to the index I am not aware of.

On 9/10/07, Yonik Seeley [EMAIL PROTECTED] wrote:
 On 9/10/07, Robin Bonin [EMAIL PROTECTED] wrote:
  I had created a new index over the weekend, and the final size was a
  few hundred megs.
  I just checked and now the index folder is up to 1.7 Gig. Is this due
  to results being cached? can I set a limit to how large the index will
  grow? is there anything else that could be effecting this file size?

 index normally refers to the index files on the disk... is this what you 
 mean?
 If so, it shouldn't grow unless new documents are added.

 -Yonik



Re: How to patch

2007-09-10 Thread Mike Klaas


On 9-Sep-07, at 8:57 PM, James liu wrote:


i wanna try patch:
https://issues.apache.org/jira/browse/SOLR-139? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel


and i download solr1.2 release

patch  SOLR-269*.pach(when in
'/tmp/apache-solr-1.2.0/src/test/org/apache/solr/update'
)


Patches should be generally applied from the top-level solr directory  
with 'patch -p0'


-Mike


RE: adding without overriding dups - DirectUpdateHandler2.java does not implement?

2007-09-10 Thread Lance Norskog
I was unclear.  Our use case is that for some data sources we submit the
same thing over and over. Overwriting deletes the first one and we end up
with long commit times, and also we lose the earliest known date for the
document. We would like to have the second update attempt dropped. So we
would like to use allowDups=false overwritePending=false
overwriteCommitted=false.  In DUH2, this case is rejected and contains the
comment:

 // this would need a reader to implement (to be able to check
committed
 // before adding.)

Anyway, I think we'll live with it.

Thanks,

Lance Norskog

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 07, 2007 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: adding without overriding dups - DirectUpdateHandler2.java does
not implement?

On 7-Sep-07, at 1:35 PM, Lance Norskog wrote:

 Hi-

 It appears that DirectUpdateHandler2.java does not actually implement 
 the parameters that control whether to override existing documents.
 Should I use

No?  allowDups=true ovewritePending=false overwriteCommited=false should
result in adding docs with no overwriting with DUH2.

As yonik said, overwriting is the default behaviour.  It is based on
uniqueKey, which must be defined for overwriting to work.

 DirectUpdateHandler instead? Apparently DUH is slower than DUH2, but 
 DUH implements these parameters.  (We do so many overwrites that 
 switching to DUH is probably a win.)

DUH also does not implement many newer update features, like autoCommit.

-Mike



Re: Solr and KStem

2007-09-10 Thread Mike Klaas

Hi Harry,

Thanks for your contribution!  Unfortunately, we can't include it in  
Solr unless the necessary legal hurdles are cleared.


An issue needs to be opened on http://issues.apache.org/jira/browse/ 
SOLR and you have to attach the file and check the Grant License to  
ASF button.  It is also important to verify that you have the legal  
right to grant the code to ASF (since it is probably your employer's  
intellectual property).


Legal issues are a hassle, but are unavoidable, I'm afraid.

Thanks again,
-Mike

On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote:


Hi Yonik,
The modified KStemmer source is attached. The original KStemFilter is
now wrapped (and replaced) by KStemFilterFactory.  I also changed the
path to avoid any naming collisions with existing Lucene code.

I included the jar file also, for anyone who wants to just drop and
play:

- put KStem2.jar in your solr/lib directory.
- change your schema to use: filter
class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/
- restart your app server

I don't know if you credit contributions, but if so please include  
OCLC.

Seems only fair since I did this on their dime :)

Cheers!
harry


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, September 07, 2007 3:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr and KStem

On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote:

I've implemented a Solr plug-in that wraps KStem for Solr use.  KStem

is

considered to be more appropriate for library usage since it is much
less aggressive than Porter (i.e., searches for organization do NOT
match on organ!). If there is any interest in feeding this back into
Solr I would be happy to contribute it.


Absolutely.
We need to make sure that the license for that k-stemmer is ASL
compatible of course.

-Yonik
kstem_solr.tar.gz




Re: New user question: How to show all stored fields in a result

2007-09-10 Thread melkink

Well, I figured out my problem.  User error of course ;-)

I was processing documents in two separate steps.  The first step added the
id and the doctext fields.  The second step did an update to add the
metadata.  I didn't realize that an update command replaced the whole
document rather than just the pieces you specify.  I altered the process so
that everything was added in one step and now things are working much
better.

The other change I made (which may or may not have contributed to the
solution) was to remove all line breaks from the text being submitted to the
doctext field.  The line breaks were causing solr to interpret the text as
having multiple values and forced me to put a multivalued=true attribute
in the schema.xml.  Removing the line breaks allowed me to remove this
attribute.

*Breathes giant sigh of relief*
-- 
View this message in context: 
http://www.nabble.com/New-user-question%3A-How-to-show-all-stored-fields-in-a-result-tf4394773.html#a12599438
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr and KStem

2007-09-10 Thread Yonik Seeley
Some other notes:
I just read the license... it's nice and short, and appears to be ASL
compatible to me.
We could either include the source in Solr and build it, or add it as
a pre-compiled jar into lib.
The FilterFactory should probably have it's package changed to
org.apache.solr.analysis (definitely if it will be included in source
form in our repository).


-Yonik

On 9/10/07, Mike Klaas [EMAIL PROTECTED] wrote:
 Hi Harry,

 Thanks for your contribution!  Unfortunately, we can't include it in
 Solr unless the necessary legal hurdles are cleared.

 An issue needs to be opened on http://issues.apache.org/jira/browse/
 SOLR and you have to attach the file and check the Grant License to
 ASF button.  It is also important to verify that you have the legal
 right to grant the code to ASF (since it is probably your employer's
 intellectual property).

 Legal issues are a hassle, but are unavoidable, I'm afraid.

 Thanks again,
 -Mike

 On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote:

  Hi Yonik,
  The modified KStemmer source is attached. The original KStemFilter is
  now wrapped (and replaced) by KStemFilterFactory.  I also changed the
  path to avoid any naming collisions with existing Lucene code.
 
  I included the jar file also, for anyone who wants to just drop and
  play:
 
  - put KStem2.jar in your solr/lib directory.
  - change your schema to use: filter
  class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/
  - restart your app server
 
  I don't know if you credit contributions, but if so please include
  OCLC.
  Seems only fair since I did this on their dime :)
 
  Cheers!
  harry
 
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
  Seeley
  Sent: Friday, September 07, 2007 3:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr and KStem
 
  On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote:
  I've implemented a Solr plug-in that wraps KStem for Solr use.  KStem
  is
  considered to be more appropriate for library usage since it is much
  less aggressive than Porter (i.e., searches for organization do NOT
  match on organ!). If there is any interest in feeding this back into
  Solr I would be happy to contribute it.
 
  Absolutely.
  We need to make sure that the license for that k-stemmer is ASL
  compatible of course.
 
  -Yonik
  kstem_solr.tar.gz




RE: Solr and KStem

2007-09-10 Thread Wagner,Harry
Hi Yonik and Mike,
No problem regarding my employer.  I've checked and they are happy to
contribute it.  I'm not sure what to do about the KStem code though.  It
was originally written by Bob Krovetz and then modified for Lucene by
Sergio Guzman-Lara (both from UMASS Amherst).  I modified the Guzman
version for Solr.  Perhaps I should contribute only what I modified,
with instructions for making it work?

Let me know... harry

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Monday, September 10, 2007 2:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr and KStem

Hi Harry,

Thanks for your contribution!  Unfortunately, we can't include it in  
Solr unless the necessary legal hurdles are cleared.

An issue needs to be opened on http://issues.apache.org/jira/browse/ 
SOLR and you have to attach the file and check the Grant License to  
ASF button.  It is also important to verify that you have the legal  
right to grant the code to ASF (since it is probably your employer's  
intellectual property).

Legal issues are a hassle, but are unavoidable, I'm afraid.

Thanks again,
-Mike

On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote:

 Hi Yonik,
 The modified KStemmer source is attached. The original KStemFilter is
 now wrapped (and replaced) by KStemFilterFactory.  I also changed the
 path to avoid any naming collisions with existing Lucene code.

 I included the jar file also, for anyone who wants to just drop and
 play:

 - put KStem2.jar in your solr/lib directory.
 - change your schema to use: filter
 class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/
 - restart your app server

 I don't know if you credit contributions, but if so please include  
 OCLC.
 Seems only fair since I did this on their dime :)

 Cheers!
 harry


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
 Seeley
 Sent: Friday, September 07, 2007 3:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and KStem

 On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote:
 I've implemented a Solr plug-in that wraps KStem for Solr use.  KStem
 is
 considered to be more appropriate for library usage since it is much
 less aggressive than Porter (i.e., searches for organization do NOT
 match on organ!). If there is any interest in feeding this back into
 Solr I would be happy to contribute it.

 Absolutely.
 We need to make sure that the license for that k-stemmer is ASL
 compatible of course.

 -Yonik
 kstem_solr.tar.gz



Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Adrian Sutton
We use DirectSolrConnection via JNI in a couple of client apps that  
sometimes have 100s of thousands of new docs as fast as Solr will  
have them. It would crash relentlessly if I didn't force all calls  
to update or query to be on the same thread using objc's  
@synchronized and a message queue. I never narrowed down if this  
was a solr issue or a JNI one.


That doesn't sound promising. I'll throw in synchronization around  
the update code and see what happens. That's doesn't seem good for  
performance though. Can Solr as a web app handle multiple updates at  
once or does it synchronize to avoid it?


Thanks,

Adrian Sutton
http://www.symphonious.net


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Mike Klaas

On 10-Sep-07, at 1:50 PM, Adrian Sutton wrote:

We use DirectSolrConnection via JNI in a couple of client apps  
that sometimes have 100s of thousands of new docs as fast as Solr  
will have them. It would crash relentlessly if I didn't force all  
calls to update or query to be on the same thread using objc's  
@synchronized and a message queue. I never narrowed down if this  
was a solr issue or a JNI one.


That doesn't sound promising. I'll throw in synchronization around  
the update code and see what happens. That's doesn't seem good for  
performance though. Can Solr as a web app handle multiple updates  
at once or does it synchronize to avoid it?


Solr can handle multiple simultaneous updates.  The entire request  
processing is concurrent, as is the document analysis.  Only the  
final write is synchronized (this includes lucene segment merging).


In the future, segment merging will occur in a separate thread,  
further improving concurrency.


-Mike


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Brian Whitman


On Sep 10, 2007, at 5:00 PM, Mike Klaas wrote:


On 10-Sep-07, at 1:50 PM, Adrian Sutton wrote:

We use DirectSolrConnection via JNI in a couple of client apps  
that sometimes have 100s of thousands of new docs as fast as Solr  
will have them. It would crash relentlessly if I didn't force all  
calls to update or query to be on the same thread using objc's  
@synchronized and a message queue. I never narrowed down if this  
was a solr issue or a JNI one.


That doesn't sound promising. I'll throw in synchronization around  
the update code and see what happens. That's doesn't seem good for  
performance though. Can Solr as a web app handle multiple updates  
at once or does it synchronize to avoid it?


Solr can handle multiple simultaneous updates.  The entire request  
processing is concurrent, as is the document analysis.  Only the  
final write is synchronized (this includes lucene segment merging).





Yes, i do want to disclaim that it's very likely my thread problems  
are an implementation detail w/ JNI, nothing to do w/ DSC.


-b




Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Ryan McKinley


The other problem is that after some time we get a Too Many Open Files 
error when autocommit fires. 


Have you checked your ulimit settings?

http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82

ulimit -n number.

As mike mentioned, you may also want to use 'single' as the lockType. 
In solrconfig set:


indexDefaults
  ...
  lockTypesingle/lockType
/indexDefaults




I could of course switch to using the Solr webapp since we're running in 
Tomcat anyway, however I really like the ability to have a single WAR 
file that contains everything and also not have to worry about actually 
making HTTP requests and the complexity that adds.




This sounds like a good candidate to try solrj:
 http://wiki.apache.org/solr/Solrj

This way you write your app independent of how you connect to solr.  It 
also takes care of the XML parsing for you and lets you work with 
objects rather then strings.


ryan


Removing lengthNorm from the calculation

2007-09-10 Thread Kyle Banerjee
I know I'm missing something really obvious, but I'm spinning my
wheels figuring out how to eliminate lengthNorm from the calculations.

The specific problem I'm trying to solve is that naive queries are
resulting in crummy short records near the top of the list. The
reality is that the longer records tend to be higher quality, so if
anything, they need to be emphasized.

However, I'm missing something simple. Any advice or a pointer to an
example I could model off would be greatly appreciated. Thanks,

kyle


Re: Removing lengthNorm from the calculation

2007-09-10 Thread Yonik Seeley
If you aren't using index-time document boosting, or field boosting
for that field specifically, then set omitNorms=true for that field
in the schema, shut down solr, completely remove the index, and then
re-index.

The norms for each field consist of the index-time boost multiplied by
the length normalization.

-Yonik


On 9/10/07, Kyle Banerjee [EMAIL PROTECTED] wrote:
 I know I'm missing something really obvious, but I'm spinning my
 wheels figuring out how to eliminate lengthNorm from the calculations.

 The specific problem I'm trying to solve is that naive queries are
 resulting in crummy short records near the top of the list. The
 reality is that the longer records tend to be higher quality, so if
 anything, they need to be emphasized.

 However, I'm missing something simple. Any advice or a pointer to an
 example I could model off would be greatly appreciated. Thanks,

 kyle


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Ryan McKinley

Adrian Sutton wrote:

On 11/09/2007, at 7:21 AM, Ryan McKinley wrote:

The other problem is that after some time we get a Too Many Open 
Files error when autocommit fires.


Have you checked your ulimit settings?

http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82 



ulimit -n number.


Yeah I'm aware of the ulimit, I'm just keen to identify what's causing 
it to happen before starting to increase limits. Given the write.lock 
errors as well I'm particularly suspicious of it. That said, most likely 
it happens whenever a search and a write are happening at the same time 
and two sets of the files get opened which is enough to kick it over the 
limit. The fact that it fixes itself is a good indication that it's not 
a file handle leak.


lucene opens a lot of files.  It can easily get beyond 1024. (I think 
the default).  I'm no expert on how the file handling works, but I think 
more files are open if you are searching and writing at the same time.


If you can't increase the limit you can try:
 useCompoundFiletrue/useCompoundFile

It is slower, but if you are unable to change the ulimit on the deployed 
machines





As mike mentioned, you may also want to use 'single' as the lockType. 
In solrconfig set:


indexDefaults
  ...
  lockTypesingle/lockType
/indexDefaults


I'll give that a go. Looks like it didn't make it into Solr 1.2 so I'll 
try upgrading to the nightly build.




If you need to use this in production soon, I'd suggest sticking with 
1.2 for a while.  There has been a LOT of action in trunk and it may be 
good to let it settle before upgrading a production system.


You should not need to upgrade to fix the write.lock and Too Many Open 
Files problem.  Try increasing ulimit or using a compoundfile before 
upgrading.




Just when you think you know everything on the wiki 


someone finally updates it!




Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Adrian Sutton

On 11/09/2007, at 8:46 AM, Ryan McKinley wrote:
lucene opens a lot of files.  It can easily get beyond 1024. (I  
think the default).  I'm no expert on how the file handling works,  
but I think more files are open if you are searching and writing at  
the same time.


If you can't increase the limit you can try:
 useCompoundFiletrue/useCompoundFile

It is slower, but if you are unable to change the ulimit on the  
deployed machines


I've done a bit of poking on the server and ulimit doesn't seem to be  
the problem:

e2wiki:~$ ulimit
unlimited
e2wiki:~$ cat /proc/sys/fs/file-max
170355

So there's either something going on behind my back (quite possible,  
it's a VM) or lucene is opening a really insane number of files. I  
did check that those values were the same for the tomcat55 user that  
Tomcat actually runs as.  An lsof -p on the Tomcat process always  
shows 40 files in use, the total open files sits around 1000-1500  
even when reindexing all the content. I'll watch it a bit more over  
time and see what happens.


I notice that Confluence recommends at least 20 for the max file  
limit, at least before they switched to compound indexing so it's  
possible that the 170355 limit could be reached, but it seems  
unlikely with our load.


If you need to use this in production soon, I'd suggest sticking  
with 1.2 for a while.  There has been a LOT of action in trunk and  
it may be good to let it settle before upgrading a production system.


You should not need to upgrade to fix the write.lock and Too Many  
Open Files problem.  Try increasing ulimit or using a compoundfile  
before upgrading.


We're quite a way off of real production, it's just internal use at  
the moment (on the real product server, but we're a small company so  
we can handle having some problems). I'll try out the current nightly  
build and see how it goes, as much as anything out of interest but  
probably won't pull new builds very often.


Thanks again,

Adrian Sutton
http://www.symphonious.net


Re: Removing lengthNorm from the calculation

2007-09-10 Thread Mike Klaas

On 10-Sep-07, at 3:31 PM, Kyle Banerjee wrote:


I know I'm missing something really obvious, but I'm spinning my
wheels figuring out how to eliminate lengthNorm from the calculations.

The specific problem I'm trying to solve is that naive queries are
resulting in crummy short records near the top of the list. The
reality is that the longer records tend to be higher quality, so if
anything, they need to be emphasized.

However, I'm missing something simple. Any advice or a pointer to an
example I could model off would be greatly appreciated. Thanks,


My lengthNorm() method is filled with clauses like:

} else if (whatever.equals(fieldName)) {
  return super.lengthNorm(fieldName,  /
  Math.max(numTokens, MIN_LENGTH));

where MIN_LENGTH can be quite long for some fields.

-Mike


Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Ryan McKinley


I've done a bit of poking on the server and ulimit doesn't seem to be 
the problem:

e2wiki:~$ ulimit
unlimited
e2wiki:~$ cat /proc/sys/fs/file-max
170355


try: ulimit -n

ulimit on its own is something else.  On my machine I get:

[EMAIL PROTECTED]:~$ ulimit
unlimited
[EMAIL PROTECTED]:~$ cat /proc/sys/fs/file-max
364770
[EMAIL PROTECTED]:~$ ulimit -n
1024


I have to run:
ulimit -n 2

to get lucene to run w/ a large index...


Re: New user question: How to show all stored fields in a result

2007-09-10 Thread Erik Hatcher


On Sep 10, 2007, at 3:07 PM, Mike Klaas wrote:

On 10-Sep-07, at 11:54 AM, melkink wrote:



The other change I made (which may or may not have contributed to the
solution) was to remove all line breaks from the text being  
submitted to the
doctext field.  The line breaks were causing solr to interpret the  
text as
having multiple values and forced me to put a multivalued=true  
attribute
in the schema.xml.  Removing the line breaks allowed me to remove  
this

attribute.


Interesting--I've never seen this behaviour (I definitely store  
fields with linebreaks in strings).


Are you sure that it isn't your own framework that is generating  
multiple field entries for this input case?


Interestingly the solr-ruby library would create multiple field  
versions before I fixed the issue.  A document like this:


   {:id = 123, :text = a newline\nin the middle}

would require text to be multiValued.  The reason was because the  
magic under the covers looks at the field value objects and iterates  
over them if they implement the #each method.  String#each returns  
each _line_ - *sigh* (going away in later versions of Ruby, thank  
goodness).


melkink - are you using solr-ruby?  If so, that bug has been fixed in  
later versions ;)


Erik



Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Adrian Sutton

On 11/09/2007, at 9:48 AM, Ryan McKinley wrote:

try: ulimit -n

ulimit on its own is something else.  On my machine I get:

[EMAIL PROTECTED]:~$ ulimit
unlimited
[EMAIL PROTECTED]:~$ cat /proc/sys/fs/file-max
364770
[EMAIL PROTECTED]:~$ ulimit -n
1024


I have to run:
ulimit -n 2

to get lucene to run w/ a large index...


Bingo, I'm an idiot - or rather, I now know *why* I'm an idiot. :)   
I'll give it a go.


Also, this is likely to be the cause of my write.lock problems - the  
Too many files exception just occured and the write.lock file gets  
left around (should have seen that one coming too).


Thanks for your help, I'm anticipating that this will solve our  
problems.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: Solr and KStem

2007-09-10 Thread Bill Fowler
Hello,

I would like to test this and have a few questions (please excuse what may
seem naive questions).

I would like to verify that this is purely a configuration feature -- since
the schema.xml defines the analysis/tokerizer chain no other changes are
required.  Also, the source seems to say that a lower case factory needs to
be farther down the tokenizer chain.  So does this mean that the KStem
factory appears before the lower case filter factory in the schema.xml.  Is
there a recommended (required?) tokenizer factory.  I am using the
WhiteSpaceFactory which seems OK.  Finally, I take it that I need to remove
the EnglishPorterFilterFactory item in the schema.xml -- or no?

Thanks,

Bill



On 9/10/07, Wagner,Harry [EMAIL PROTECTED] wrote:

 Hi Yonik,
 The modified KStemmer source is attached. The original KStemFilter is
 now wrapped (and replaced) by KStemFilterFactory.  I also changed the
 path to avoid any naming collisions with existing Lucene code.

 I included the jar file also, for anyone who wants to just drop and
 play:

 - put KStem2.jar in your solr/lib directory.
 - change your schema to use: filter
 class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/
 - restart your app server

 I don't know if you credit contributions, but if so please include OCLC.
 Seems only fair since I did this on their dime :)

 Cheers!
 harry


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
 Seeley
 Sent: Friday, September 07, 2007 3:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr and KStem

 On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote:
  I've implemented a Solr plug-in that wraps KStem for Solr use.  KStem
 is
  considered to be more appropriate for library usage since it is much
  less aggressive than Porter (i.e., searches for organization do NOT
  match on organ!). If there is any interest in feeding this back into
  Solr I would be happy to contribute it.

 Absolutely.
 We need to make sure that the license for that k-stemmer is ASL
 compatible of course.

 -Yonik