Fwd: Performance help for heavy indexing workload

2008-02-12 Thread James Brady

Hi again,
More analysis showed that the extraordinarily long query times only  
appeared when I specify a sort. A concrete example:


For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 
3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther=

The QTime is ~500ms.
For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 
3A39start=0rows=1fl=*% 
2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc

The QTime is ~75s

I.e. I am using the StandardRequestHandler to search for a user  
entered term (apache above) and filtering by a user_id field.


This seems to be the case for every sort option except score asc and  
score desc. Please tell me Solr doesn't sort all matching documents  
before applying boolean filters?


James

Begin forwarded message:


From: James Brady [EMAIL PROTECTED]
Date: 11 February 2008 23:38:16 GMT-08:00
To: solr-user@lucene.apache.org
Subject: Performance help for heavy indexing workload

Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more  
indexing than searching.


At present, it needs to index around two documents / sec - a  
document being the stripped content of a webpage. However,  
performance was so poor that I've had to disable indexing of the  
webpage content as an emergency measure. In addition, some search  
queries take an inordinate length of time - regularly over 60 seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
and 8GB RAM), and there's not too much else going on on the box. In  
total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs, mergeFactor  
and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task  
causing a database commit every 15 seconds.


Obviously, every workload varies, but could anyone comment on  
whether this sort of hardware should, with proper configuration, be  
able to manage this sort of workload?


I can't see signs of Solr being IO-bound, CPU-bound or memory- 
bound, although my scheduled commit operation, or perhaps GC, does  
spike up the CPU utilisation at intervals.


Any help appreciated!
James




Re: SolrJ and Unique Doc ID

2008-02-12 Thread Chris Hostetter
:  Honestly: i can't think of a single use case where client code would care
:  about what the uniqueKey field is, unless it already *knew* what the
:  uniqueKey field is.
: 
: :-)  Abstractions allow one to use different implementations.  My
: client/display doesn't know about Solr, it just knows it can search and the
: Solr implementation part of it can be pointed at any Solr instance (or other
: search engines as well), thus it needs to be able to reflect on Solr.  The
: unique key is a pretty generally useful thing across implementations.

but why does your client/display care which field is the uniqueKey field?  
knowing which fields it might query or ask for in the fl list sure -- but 
why need to know about hte uniqueKey field specificly?

I could have an index of people where i document thatthe SSN field is 
unique, and never even tell you that it's not the 'uniqueKey' Field -- 
that could be some completley unrelated field i don't want you to know 
about called customerId -- but that doesn't acceft you as a client, you 
can still query on whatever you wnat, get back whatever docs you want, 
etc...  the onlything you can't do is delete by id (since you can't be 
sure which field is the uniqueKey) but you can always delete by query.

: In fact, I wish all the ReqHandlers had an introspection option, where one
: could see what params are supported as well.

you and me both -- but the introspection shouldn't be intrinsic to the 
ReuestHandler - as the Solr admin i may not want to expose all of those 
options to my clients...

http://wiki.apache.org/solr/MakeSolrMoreSelfService


-Hoss



Filter Query

2008-02-12 Thread Evgeniy Strokin
Hello,.. Lets say I have one query like this:
NAME:Smith
I need to restrict the result and I'm doing this:
NAME:Smith AND AGE:30
Also, I can do this using fq parameter:
q=NAME:Smithfq=AGE:30
The result of second and third queries should be the same, right?
But why should I use fq then? In which cases this is better? Can you give me 
example to better understand the problem?
 
Thank you
Gene

Re: Performance help for heavy indexing workload

2008-02-12 Thread Mike Klaas

On 11-Feb-08, at 11:38 PM, James Brady wrote:


Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more  
indexing than searching.


At present, it needs to index around two documents / sec - a  
document being the stripped content of a webpage. However,  
performance was so poor that I've had to disable indexing of the  
webpage content as an emergency measure. In addition, some search  
queries take an inordinate length of time - regularly over 60 seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
and 8GB RAM), and there's not too much else going on on the box. In  
total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs, mergeFactor  
and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task  
causing a database commit every 15 seconds.


By database commit do you mean solr commit?  If so, that is far  
too frequent if you are sorting on big fields.


I use Solr to serve queries for ~10m docs on a medium size EC2  
instance.  This is an optimized configuration where highlighting is  
broken off into a separate index, and load balanced into two  
subindices of 5m docs a piece.  I do a good deal of faceting but no  
sorting.  The only reason that this is possible is that the index is  
only updated every few days.


On another box we have a several hundred thousand document index  
which is updated relatively frequently (autocommit time: 20s).  These  
are merged with the static-er index to create an illusion of real- 
time index updates.


When lucene supports efficient, reopen()able fieldcache upates, this  
situation might improve, but the above architecture would still  
probably be better.  Note that the second index can be on the same  
machine.


-Mike


Re: SolrJ and Unique Doc ID

2008-02-12 Thread Erik Hatcher


On Feb 12, 2008, at 3:44 PM, Grant Ingersoll wrote:

On Feb 12, 2008, at 2:10 PM, Chris Hostetter wrote:

:  Honestly: i can't think of a single use case where client code  
would care
:  about what the uniqueKey field is, unless it already *knew*  
what the

:  uniqueKey field is.
:
: :-)  Abstractions allow one to use different implementations.  My
: client/display doesn't know about Solr, it just knows it can  
search and the
: Solr implementation part of it can be pointed at any Solr  
instance (or other
: search engines as well), thus it needs to be able to reflect  
on Solr.  The
: unique key is a pretty generally useful thing across  
implementations.


but why does your client/display care which field is the uniqueKey  
field?
knowing which fields it might query or ask for in the fl list sure  
-- but

why need to know about hte uniqueKey field specificly?


How do I generate URLs to retrieve a document against any given  
Solr instance that I happen to be pointing at without knowing which  
field is the document id?


One cool technique, not instead of your change to Luke RH (a needed  
change IMO) but another  way to go about it - we have a  
DocumentRequestHandler that takes a uniqueKey parameter that  would  
retrieve and return that single document without having to specify  
the field name explicitly.


Erik



RE: Performance help for heavy indexing workload

2008-02-12 Thread Lance Norskog
1) autowarming: it means that if you have a cached query or similar, and do
a commit, it then reloads each cached query. This is in solrconfig.xml
2) sorting is a pig. A sort creates an array of N integers where N is the
size of the index, not the query. If the sorted field is anything but an
integer, a second array of size N is created with a copy of the field's
contents.  If you want a field to sort fast, you have to make it an int or
make an integer-format shadow field.

3) Large query return sets cause out-of-memory exceptions. If the Solr is
only doing queries, this is OK: the instance keeps working. We find that if
the Solr is also indexing when you hit an out-of-memory, the instance is
unusueable until you restart the Java container. This is with Tomcat 5 and
Linux RHEL4 with the standard Linux file system.

4) This can also be done by having one index. You do a mass delete on stuff
from 8 days ago.  There is a larger IT commitment in running multiple Solrs
or Lucene files. This is not Oracle or MySQL, where it is well-behaved and
you get cute little UIs to run everything. A large Solr index with
continuous indexing is not a turnkey application.

5) Be sure to check out 'filters'. These are really useful for trimming
queries if you have commonly used subsets of the index, like language =
English.

We were new to Solr and Lucene and transferred over a several-million-record
index from FAST in 3 weeks. There is a learning curve, but it is an
impressive app.

Lance

-Original Message-
From: James Brady [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 12, 2008 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance help for heavy indexing workload

Hi - thanks to everyone for their responses.

A couple of extra pieces of data which should help me optimise - documents
are very rarely updated once in the index, and I can throw away index data
older than 7 days.

So, based on advice from Mike and Walter, it seems my best option will be to
have seven separate indices. 6 indices will never change and hold data from
the six previous days. One index will change and will hold data from the
current day. Deletions and updates will be handled by effectively storing a
revocation list in the mutable index.

In this way, I will only need to perform Solr commits (yes, I did mean Solr
commits rather than database commits below - my apologies) on the current
day's index, and closing and opening new searchers for these commits
shouldn't be as painful as it is currently.

To do this, I need to work out how to do the following:
- parallel multi search through Solr
- move to a new index on a scheduled basis (probably commit and optimise the
index at this point)
- ideally, properly warm new searchers in the background to further improve
search performance on the changing index

Does that sound like a reasonable strategy in general, and has anyone got
advice on the specific points I raise above?

Thanks,
James

On 12 Feb 2008, at 11:45, Mike Klaas wrote:

 On 11-Feb-08, at 11:38 PM, James Brady wrote:

 Hello,
 I'm looking for some configuration guidance to help improve 
 performance of my application, which tends to do a lot more indexing 
 than searching.

 At present, it needs to index around two documents / sec - a document 
 being the stripped content of a webpage. However, performance was so 
 poor that I've had to disable indexing of the webpage content as an 
 emergency measure. In addition, some search queries take an 
 inordinate length of time - regularly over 60 seconds.

 This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 
 8GB RAM), and there's not too much else going on on the box.
 In total, there are about 1.5m documents in the index.

 I'm using a fairly standard configuration - the things I've tried 
 changing so far have been parameters like maxMergeDocs, mergeFactor 
 and the autoCommit options. I'm only using the 
 StandardRequestHandler, no faceting. I have a scheduled task causing 
 a database commit every 15 seconds.

 By database commit do you mean solr commit?  If so, that is far 
 too frequent if you are sorting on big fields.

 I use Solr to serve queries for ~10m docs on a medium size EC2 
 instance.  This is an optimized configuration where highlighting is 
 broken off into a separate index, and load balanced into two 
 subindices of 5m docs a piece.  I do a good deal of faceting but no 
 sorting.  The only reason that this is possible is that the index is 
 only updated every few days.

 On another box we have a several hundred thousand document index  
 which is updated relatively frequently (autocommit time: 20s).   
 These are merged with the static-er index to create an illusion of 
 real-time index updates.

 When lucene supports efficient, reopen()able fieldcache upates, this 
 situation might improve, but the above architecture would still 
 probably be better.  Note that the second index can be on the same 
 machine.

 -Mike




Using embedded Solr with admin GUI

2008-02-12 Thread Ken Krugler

Hi all,

We're moving towards embedding multiple Solr cores, versus using 
multiple Solr webapps, as a way of simplifying our build/deploy and 
also getting more control over the startup/update process.


But I'd hate to lose that handy GUI for inspecting the schema and 
(most importantly) trying out queries with explain turned on.


Has anybody tried this dual-mode method of operation? Thoughts on 
whether it's workable, and what the issues would be?


I've taken a quick look at the .jsp and supporting Java code, and 
have some ideas on what would be needed, but I'm hoping there's an 
easy(er) approach than just whacking at the admin support code.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


Re: what is searcher

2008-02-12 Thread Briggs
Searcher is the main search abstraction in Lucene. It defines the
methods used for querying an underlying index(es).

See: 
http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/Searcher.html

On Feb 12, 2008 10:33 PM, Mochamad bahri nurhabbibi [EMAIL PROTECTED] wrote:
 hello all..

 I am learning SOLR since 2 days ago.

 I have to make training/presentation aboutSOLR  to rest of my fellow in my
 company.

 my question is: what is searcher ?

 this term seems to be found everywhere. but
 there's no exact definition of this term either in google nor SOLR wiki.

 anyone please help me..

 thank you

 regards

 - habibi-




-- 
Conscious decisions by conscious minds are what make reality real


what is searcher

2008-02-12 Thread Mochamad bahri nurhabbibi
hello all..

I am learning SOLR since 2 days ago.

I have to make training/presentation aboutSOLR  to rest of my fellow in my
company.

my question is: what is searcher ?

this term seems to be found everywhere. but
there's no exact definition of this term either in google nor SOLR wiki.

anyone please help me..

thank you

regards

- habibi-


Re: Performance help for heavy indexing workload

2008-02-12 Thread James Brady

Hi - thanks to everyone for their responses.

A couple of extra pieces of data which should help me optimise -  
documents are very rarely updated once in the index, and I can throw  
away index data older than 7 days.


So, based on advice from Mike and Walter, it seems my best option  
will be to have seven separate indices. 6 indices will never change  
and hold data from the six previous days. One index will change and  
will hold data from the current day. Deletions and updates will be  
handled by effectively storing a revocation list in the mutable index.


In this way, I will only need to perform Solr commits (yes, I did  
mean Solr commits rather than database commits below - my apologies)  
on the current day's index, and closing and opening new searchers for  
these commits shouldn't be as painful as it is currently.


To do this, I need to work out how to do the following:
- parallel multi search through Solr
- move to a new index on a scheduled basis (probably commit and  
optimise the index at this point)
- ideally, properly warm new searchers in the background to further  
improve search performance on the changing index


Does that sound like a reasonable strategy in general, and has anyone  
got advice on the specific points I raise above?


Thanks,
James

On 12 Feb 2008, at 11:45, Mike Klaas wrote:


On 11-Feb-08, at 11:38 PM, James Brady wrote:


Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more  
indexing than searching.


At present, it needs to index around two documents / sec - a  
document being the stripped content of a webpage. However,  
performance was so poor that I've had to disable indexing of the  
webpage content as an emergency measure. In addition, some search  
queries take an inordinate length of time - regularly over 60  
seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
and 8GB RAM), and there's not too much else going on on the box.  
In total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs,  
mergeFactor and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task  
causing a database commit every 15 seconds.


By database commit do you mean solr commit?  If so, that is far  
too frequent if you are sorting on big fields.


I use Solr to serve queries for ~10m docs on a medium size EC2  
instance.  This is an optimized configuration where highlighting is  
broken off into a separate index, and load balanced into two  
subindices of 5m docs a piece.  I do a good deal of faceting but no  
sorting.  The only reason that this is possible is that the index  
is only updated every few days.


On another box we have a several hundred thousand document index  
which is updated relatively frequently (autocommit time: 20s).   
These are merged with the static-er index to create an illusion of  
real-time index updates.


When lucene supports efficient, reopen()able fieldcache upates,  
this situation might improve, but the above architecture would  
still probably be better.  Note that the second index can be on the  
same machine.


-Mike




Re: 2D Facet

2008-02-12 Thread evgeniy . strokin
Chris, I'm very interested to implement generic multidimensional faceting. But 
I'm not an expert in Solr, but I'm very good with Java. So I need little bit 
more directions if you don't mind. I promise to share my code and if you'll be 
Ok with it you are welcome to use it.
So, Lets say I have a parameter facet.field=STATE. For example we'll take 3D 
faceting, so I'll need 2 more facet fields related to the first one. Should we 
do something like this:
facet.field=STATEf.STATE.facet.matrix=NAMEf.STATE.facet.matrix=INCOME
Or for example we can have may be like this:
facet.matrix=STATE,NAME,INCOME
What would you suggest is better?
Also, where in Solr I could find something similar to take it as an example? 
Where all this logic should be placed?
 
Thank you
Gene


- Original Message 
From: Chris Hostetter [EMAIL PROTECTED]
To: Solr User solr-user@lucene.apache.org
Sent: Thursday, January 17, 2008 1:12:32 AM
Subject: Re: 2D Facet

: 
: Hello, is this possible to do in one query: I have a query which returns 
: 1000 documents with names and addresses. I can run facet on state field 
: and see how many addresses I have in each state. But also I need to see 
: how many families lives in each state. So as a result I need a matrix of 
: states on top and Last Names on right. After my first query, knowing 
: which states I have I can run queries on each state using facet field 
: Last_Name. But I guess this is not an efficient way. Is this possible to 
: get in one query? Or may be some other way?

if you set rows=0 on all of those queries it won't be horribly inefficient 
... the DocSets for each state and lastname should wind up in the 
filterCache, so most of the queries will just be simple DocSet 
intersections with only the HTTP overhead (which if you use persistent 
connections should be fairly minor)

The idea of generic multidimensional faceting is acctaully pretty 
interesting ... it could be done fairly simply -- imagine if for every 
facet.field=foo param, solr checked for a f.foo.facet.matrix params, and 
once the top facet.limit terms were found for field foo it then 
computed the top facet founds for each f.foo.facet.matrix field 
with an implicit fq=foo:term.

that would be pretty cool.


-Hoss

Re: upgrading to lucene 2.3

2008-02-12 Thread Grant Ingersoll

See:

https://issues.apache.org/jira/browse/SOLR-330

https://issues.apache.org/jira/browse/SOLR-342

for various solutions around taking advantage of Lucene's new  
capabilities.


-Grant

On Feb 12, 2008, at 1:15 PM, Yonik Seeley wrote:


On Feb 12, 2008 1:06 PM, Lance Norskog [EMAIL PROTECTED] wrote:

What will this improve?


Text analysis may be slower since Solr won't have the changes to use
the faster Token APIs.
Indexing overall should still be faster.
Querying should see little change.

-Yonik





Re: upgrading to lucene 2.3

2008-02-12 Thread Yonik Seeley
On Feb 12, 2008 1:06 PM, Lance Norskog [EMAIL PROTECTED] wrote:
 What will this improve?

Text analysis may be slower since Solr won't have the changes to use
the faster Token APIs.
Indexing overall should still be faster.
Querying should see little change.

-Yonik


RE: upgrading to lucene 2.3

2008-02-12 Thread Lance Norskog
What will this improve?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Tuesday, February 12, 2008 6:48 AM
To: solr-user@lucene.apache.org
Subject: Re: upgrading to lucene 2.3

On Feb 12, 2008 9:25 AM, Robert Young [EMAIL PROTECTED] wrote:
 ok, and to do the change I just replace the jar directly in 
 sorl/WEB_INF/lib and restart tomcat?

That should work.

-Yonik



Re: SolrJ and Unique Doc ID

2008-02-12 Thread Grant Ingersoll


On Feb 12, 2008, at 2:10 PM, Chris Hostetter wrote:

:  Honestly: i can't think of a single use case where client code  
would care
:  about what the uniqueKey field is, unless it already *knew* what  
the

:  uniqueKey field is.
:
: :-)  Abstractions allow one to use different implementations.  My
: client/display doesn't know about Solr, it just knows it can  
search and the
: Solr implementation part of it can be pointed at any Solr instance  
(or other
: search engines as well), thus it needs to be able to reflect on  
Solr.  The
: unique key is a pretty generally useful thing across  
implementations.


but why does your client/display care which field is the uniqueKey  
field?
knowing which fields it might query or ask for in the fl list sure  
-- but

why need to know about hte uniqueKey field specificly?


How do I generate URLs to retrieve a document against any given Solr  
instance that I happen to be pointing at without knowing which field  
is the document id?   At any rate, the problem is solved in SOLR-478  
in less than 10 lines of code and doesn't introduce back-compat.  
issues.  I invoke this on instantiation of my client, get the field  
and then keep it around for use later.





I could have an index of people where i document thatthe SSN field is
unique, and never even tell you that it's not the 'uniqueKey' Field --
that could be some completley unrelated field i don't want you to know
about called customerId -- but that doesn't acceft you as a  
client, you

can still query on whatever you wnat, get back whatever docs you want,
etc...  the onlything you can't do is delete by id (since you  
can't be

sure which field is the uniqueKey) but you can always delete by query.

: In fact, I wish all the ReqHandlers had an introspection option,  
where one

: could see what params are supported as well.

you and me both -- but the introspection shouldn't be intrinsic to the
ReuestHandler - as the Solr admin i may not want to expose all of  
those

options to my clients...

http://wiki.apache.org/solr/MakeSolrMoreSelfService


+1


Re: Strange behavior

2008-02-12 Thread Yonik Seeley
On Feb 12, 2008 9:50 AM, Traut [EMAIL PROTECTED] wrote:
 Thank you, it works. Stemming filter works only with lowercased words?

I've never tried it in the order you have it.
You could try the analysis admin page and report back what happens...

-Yonik


 On Feb 12, 2008 4:29 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

  Try putting the stemmer after the lowercase filter.
  -Yonik
 
  On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote:
   Hi all
  
   Please take a look at this strange behavior (connected with stemming I
   suppose):
  
  
   type:
  
   fieldtype name=customTextField class=solr.TextField indexed=true
   stored=false
 analyzer type=query
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true words=
   stopwords.txt/
   filter class=solr.EnglishPorterFilterFactory protected=
   protwords.txt/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=index
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true words=
   stopwords.txt/
   filter class=solr.EnglishPorterFilterFactory protected=
   protwords.txt/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
   /fieldtype
  
   field:
  
   field name=name  type=customTextField indexed=true
   stored=false/
  
  
  
   I'm adding a document:
  
   adddocfield name=id99/fieldfield
   name=nameApple/field/doc/add
  
   commit/
  
  
   Queriyng name:apple - 0 results. Searching name:Apple - 1 result.
  But
   name:appl* - 1 result
  
  
   Adding next document:
  
   adddocfield name=id8/fieldfield
   name=nameSomenamele/field/doc/add
  
   commit/
  
  
   Searching for name:somenamele - 1 result, for name:Somenamele - 1
  result
  
  
   What is the problem with Apple ? Maybe StandardTokenizer understands
  it as
   trademark :) ?
  
  
   Thank you in advence
  
  
   --
   Best regards,
   Traut
  
 



 --
 Best regards,
 Traut



Re: Setting the schema files

2008-02-12 Thread Ryan McKinley

Aditi Goyal wrote:

Hi,

I am using the SOLR searching in my project. I am actually little bit
confused about how the schema works.
Can you please provide me the documentation where I can define how should my
query work?
Like, I want that a, and, the etc should not be searched. Also, it should
not spilt on case change. And it should not look for the sub words. I mean
it should completely search the word and not partially.



all docs are pointed to from the Documentation link on the left of:
http://lucene.apache.org/solr/

perhaps the most important one is:
http://wiki.apache.org/solr/


Specifically, it looks like you are looking for the StopFilterFactory:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-9e6f07472dbdf0facc966ac61c25145be1ae0d5d


ryan



Re: Performance help for heavy indexing workload

2008-02-12 Thread Walter Underwood
On 2/12/08 7:40 AM, Ken Krugler [EMAIL PROTECTED] wrote:

 In general immediate updating of an index with a continuous stream of
 new content, and fast search results, work in opposition. The
 searcher's various caches are getting continuously flushed to avoid
 stale content, which can easily kill your performance.

One approach is to have a big, rarely-updated index and a small index
for new or changed content. Once a day, add everything from the small
index into the big one. You may need external bookkeeping for deleted
documents.

Another trick from Infoseek.

wunder



Re: Performance help for heavy indexing workload

2008-02-12 Thread Walter Underwood
That does seem really slow. Is the index on NFS-mounted storage?

wunder

On 2/12/08 7:04 AM, Erick Erickson [EMAIL PROTECTED] wrote:

 Well, the *first* sort to the underlying Lucene engine is expensive since
 it builds up the terms to sort. I wonder if you're closing and opening the
 underlying searcher for every request? This is a definite limiter.
 
 Disclaimer: I mostly do Lucene, not SOLR (yet), so don't *even* ask
 me how to change this behavior G. But your comment about
 frequent updates to the index prompted this question
 
 Best
 Erick
 
 On Feb 12, 2008 3:54 AM, James Brady [EMAIL PROTECTED] wrote:
 
 Hi again,
 More analysis showed that the extraordinarily long query times only
 appeared when I specify a sort. A concrete example:
 
 For a querystring such as: ?indent=onversion=2.2q=apache+user_id%
 3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther=
 The QTime is ~500ms.
 For a querystring such as: ?indent=onversion=2.2q=apache+user_id%
 3A39start=0rows=1fl=*%
 2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc
 The QTime is ~75s
 
 I.e. I am using the StandardRequestHandler to search for a user
 entered term (apache above) and filtering by a user_id field.
 
 This seems to be the case for every sort option except score asc and
 score desc. Please tell me Solr doesn't sort all matching documents
 before applying boolean filters?
 
 James
 
 Begin forwarded message:
 
 From: James Brady [EMAIL PROTECTED]
 Date: 11 February 2008 23:38:16 GMT-08:00
 To: solr-user@lucene.apache.org
 Subject: Performance help for heavy indexing workload
 
 Hello,
 I'm looking for some configuration guidance to help improve
 performance of my application, which tends to do a lot more
 indexing than searching.
 
 At present, it needs to index around two documents / sec - a
 document being the stripped content of a webpage. However,
 performance was so poor that I've had to disable indexing of the
 webpage content as an emergency measure. In addition, some search
 queries take an inordinate length of time - regularly over 60 seconds.
 
 This is running on a medium sized EC2 instance (2 x 2GHz Opterons
 and 8GB RAM), and there's not too much else going on on the box. In
 total, there are about 1.5m documents in the index.
 
 I'm using a fairly standard configuration - the things I've tried
 changing so far have been parameters like maxMergeDocs, mergeFactor
 and the autoCommit options. I'm only using the
 StandardRequestHandler, no faceting. I have a scheduled task
 causing a database commit every 15 seconds.
 
 Obviously, every workload varies, but could anyone comment on
 whether this sort of hardware should, with proper configuration, be
 able to manage this sort of workload?
 
 I can't see signs of Solr being IO-bound, CPU-bound or memory-
 bound, although my scheduled commit operation, or perhaps GC, does
 spike up the CPU utilisation at intervals.
 
 Any help appreciated!
 James
 
 



Re: Fwd: Performance help for heavy indexing workload

2008-02-12 Thread Ken Krugler

Hi James,

I'm looking for some configuration guidance to help improve 
performance of my application, which tends to do a lot more 
indexing than searching.


At present, it needs to index around two documents / sec - a 
document being the stripped content of a webpage. However, 
performance was so poor that I've had to disable indexing of the 
webpage content as an emergency measure. In addition, some search 
queries take an inordinate length of time - regularly over 60 
seconds.


In general immediate updating of an index with a continuous stream of 
new content, and fast search results, work in opposition. The 
searcher's various caches are getting continuously flushed to avoid 
stale content, which can easily kill your performance.


This issue was one of the more interesting topics discussed during 
the Lucene BoF meeting at ApacheCon. You're not alone in wanting to 
have it both ways, but it's clear this is A Hard Problem.


If you can relax the need for immediate updates to the index, and 
accept some level of lag time between receiving new content and this 
showing up in the index, then I'd suggest splitting the two 
processes. Have a backend system that deals with updates, and then at 
some slower interval update the search index.


-- Ken



This is running on a medium sized EC2 instance (2 x 2GHz Opterons 
and 8GB RAM), and there's not too much else going on on the box. In 
total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried 
changing so far have been parameters like maxMergeDocs, mergeFactor 
and the autoCommit options. I'm only using the 
StandardRequestHandler, no faceting. I have a scheduled task 
causing a database commit every 15 seconds.


Obviously, every workload varies, but could anyone comment on 
whether this sort of hardware should, with proper configuration, be 
able to manage this sort of workload?


I can't see signs of Solr being IO-bound, CPU-bound or 
memory-bound, although my scheduled commit operation, or perhaps 
GC, does spike up the CPU utilisation at intervals.


Any help appreciated!
James



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


wildcard query question

2008-02-12 Thread Alessandro Senserini
I have indexed a field called courseTitle of 'text' type (as in the
schema.xml but without the stemming factory) that contains

 

COBOL: Data Structure

 

Searching with a wildcard query like

 

courseTitle:cobol\:*  AND courseTitle:data* AND courseTitle:structure*

 

(the colon character : is escaped) the record is not found.  If the
search is

 

courseTitle:cobol*  AND courseTitle:data* AND courseTitle:structure*

 

the record is found.  I was wondering how the colon character affects
the search, and if there is another way to write a wildcard query.

 

Thanks.

 


.
The information contained in this e-mail message is intended only for the 
personal 
and confidential use of the recipient(s) named above. This message is 
privileged 
and confidential. If the reader of this message is not the intended recipient 
or an
agent responsible for delivering it to the intended recipient, you are hereby 
notified 
that you have received this document in error and that any review, 
dissemination, 
distribution, or copying of this message is strictly prohibited.



RE: upgrading to lucene 2.3

2008-02-12 Thread Fuad Efendi
I did the same:

Stopped SOLR-1.2, replaced Lucene jars, started SOLR-1.2

No any problem.


 -Original Message-
 From: Robert Young [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, February 12, 2008 9:25 AM
 To: solr-user@lucene.apache.org
 Subject: Re: upgrading to lucene 2.3
 
 
 ok, and to do the change I just replace the jar directly in
 sorl/WEB_INF/lib and restart tomcat?
 
 Thanks
 Rob
 
 On Feb 12, 2008 1:55 PM, Grant Ingersoll [EMAIL PROTECTED] wrote:
  Solr Trunk is using the latest Lucene version.  Also note 
 there are a
  couple edge cases in Lucene 2.3 that are causing problems if you use
  SOLR-342 with lucenAutoCommit == false.
 
  But, yes, you should be able to drop in 2.3, as that is one of the
  back-compatible goals for Lucene minor releases.
 
  -Grant
 
 
  On Feb 12, 2008, at 8:06 AM, Robert Young wrote:
 
   I have heard that upgrading to lucene 2.3 in Solr 1.2 is 
 as simple as
   replacing the lucene jar and restarting. Is this the 
 case? Has anyone
   had any experience with upgrading lucene to 2.3? Did you have any
   problems? Is there anything I should be looking out for?
  
   Thanks
   Rob
 
 
 
 



Re: upgrading to lucene 2.3

2008-02-12 Thread Robert Young
ok, and to do the change I just replace the jar directly in
sorl/WEB_INF/lib and restart tomcat?

Thanks
Rob

On Feb 12, 2008 1:55 PM, Grant Ingersoll [EMAIL PROTECTED] wrote:
 Solr Trunk is using the latest Lucene version.  Also note there are a
 couple edge cases in Lucene 2.3 that are causing problems if you use
 SOLR-342 with lucenAutoCommit == false.

 But, yes, you should be able to drop in 2.3, as that is one of the
 back-compatible goals for Lucene minor releases.

 -Grant


 On Feb 12, 2008, at 8:06 AM, Robert Young wrote:

  I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as
  replacing the lucene jar and restarting. Is this the case? Has anyone
  had any experience with upgrading lucene to 2.3? Did you have any
  problems? Is there anything I should be looking out for?
 
  Thanks
  Rob




Strange behavior

2008-02-12 Thread Traut
Hi all

Please take a look at this strange behavior (connected with stemming I
suppose):


type:

fieldtype name=customTextField class=solr.TextField indexed=true
stored=false
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=
stopwords.txt/
filter class=solr.EnglishPorterFilterFactory protected=
protwords.txt/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=
stopwords.txt/
filter class=solr.EnglishPorterFilterFactory protected=
protwords.txt/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype

field:

field name=name  type=customTextField indexed=true  stored=false/



I'm adding a document:

adddocfield name=id99/fieldfield
name=nameApple/field/doc/add

commit/


Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But
name:appl* - 1 result


Adding next document:

adddocfield name=id8/fieldfield
name=nameSomenamele/field/doc/add

commit/


Searching for name:somenamele - 1 result, for name:Somenamele - 1 result


What is the problem with Apple ? Maybe StandardTokenizer understands it as
trademark :) ?


Thank you in advence


-- 
Best regards,
Traut


Re: upgrading to lucene 2.3

2008-02-12 Thread Grant Ingersoll
Solr Trunk is using the latest Lucene version.  Also note there are a  
couple edge cases in Lucene 2.3 that are causing problems if you use  
SOLR-342 with lucenAutoCommit == false.


But, yes, you should be able to drop in 2.3, as that is one of the  
back-compatible goals for Lucene minor releases.


-Grant

On Feb 12, 2008, at 8:06 AM, Robert Young wrote:


I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as
replacing the lucene jar and restarting. Is this the case? Has anyone
had any experience with upgrading lucene to 2.3? Did you have any
problems? Is there anything I should be looking out for?

Thanks
Rob




RE: Commit preformance problem

2008-02-12 Thread Jae Joo
Or, if you have multiple files to be updated, please make sure Index
Multiple Files and commit Once at the end of Indexing..

Jae

-Original Message-
From: Jae Joo [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 12, 2008 10:50 AM
To: solr-user@lucene.apache.org
Subject: RE: Commit preformance problem

I have same experience.. I do have 6.5G Index and update it daily.
Have you ever check that the updated file does not have any document and
tried commit? I don't know why, but it takes so long - more than 10
minutes.

Jae Joo

-Original Message-
From: Ken Krugler [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 12, 2008 10:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Commit preformance problem

I have a large solr index that is currently about 6 GB and is suffering
of
severe performance problems during updates. A commit can take over 10
minutes to complete. I have tried to increase max memory to the JVM to
over
6 GB, but without any improvement. I have also tried to turn off
waitSearcher and waitFlush, which do significantly improve the commit
speed.
However, the max number of searchers is then quickly reached.

If you have a large index, then I'd recommend having a separate Solr 
installation that you use to update/commit changes, after which you 
use snappuller or equivalent to swap it in to the live (search) 
system.

Would a switch to another container (currently using Jetty) make any
difference?

Very unlikely.

Does anyone have any other tip for improving the performance?

Switch to Lucene 2.3, and tune the new parameters that control memory 
usage during updating.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


RE: Commit preformance problem

2008-02-12 Thread Jae Joo
I have same experience.. I do have 6.5G Index and update it daily.
Have you ever check that the updated file does not have any document and
tried commit? I don't know why, but it takes so long - more than 10
minutes.

Jae Joo

-Original Message-
From: Ken Krugler [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 12, 2008 10:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Commit preformance problem

I have a large solr index that is currently about 6 GB and is suffering
of
severe performance problems during updates. A commit can take over 10
minutes to complete. I have tried to increase max memory to the JVM to
over
6 GB, but without any improvement. I have also tried to turn off
waitSearcher and waitFlush, which do significantly improve the commit
speed.
However, the max number of searchers is then quickly reached.

If you have a large index, then I'd recommend having a separate Solr 
installation that you use to update/commit changes, after which you 
use snappuller or equivalent to swap it in to the live (search) 
system.

Would a switch to another container (currently using Jetty) make any
difference?

Very unlikely.

Does anyone have any other tip for improving the performance?

Switch to Lucene 2.3, and tune the new parameters that control memory 
usage during updating.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


Re: Commit preformance problem

2008-02-12 Thread Ken Krugler

I have a large solr index that is currently about 6 GB and is suffering of
severe performance problems during updates. A commit can take over 10
minutes to complete. I have tried to increase max memory to the JVM to over
6 GB, but without any improvement. I have also tried to turn off
waitSearcher and waitFlush, which do significantly improve the commit speed.
However, the max number of searchers is then quickly reached.


If you have a large index, then I'd recommend having a separate Solr 
installation that you use to update/commit changes, after which you 
use snappuller or equivalent to swap it in to the live (search) 
system.



Would a switch to another container (currently using Jetty) make any
difference?


Very unlikely.


Does anyone have any other tip for improving the performance?


Switch to Lucene 2.3, and tune the new parameters that control memory 
usage during updating.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


Commit preformance problem

2008-02-12 Thread Anders Arpteg
I have a large solr index that is currently about 6 GB and is suffering of
severe performance problems during updates. A commit can take over 10
minutes to complete. I have tried to increase max memory to the JVM to over
6 GB, but without any improvement. I have also tried to turn off
waitSearcher and waitFlush, which do significantly improve the commit speed.
However, the max number of searchers is then quickly reached.

 

Would a switch to another container (currently using Jetty) make any
difference? Does anyone have any other tip for improving the performance?

 

TIA,

Anders

 

 



Re: Strange behavior

2008-02-12 Thread Traut
Thank you, it works. Stemming filter works only with lowercased words?

On Feb 12, 2008 4:29 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 Try putting the stemmer after the lowercase filter.
 -Yonik

 On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote:
  Hi all
 
  Please take a look at this strange behavior (connected with stemming I
  suppose):
 
 
  type:
 
  fieldtype name=customTextField class=solr.TextField indexed=true
  stored=false
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true words=
  stopwords.txt/
  filter class=solr.EnglishPorterFilterFactory protected=
  protwords.txt/
  filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true words=
  stopwords.txt/
  filter class=solr.EnglishPorterFilterFactory protected=
  protwords.txt/
  filter class=solr.LowerCaseFilterFactory/
/analyzer
  /fieldtype
 
  field:
 
  field name=name  type=customTextField indexed=true
  stored=false/
 
 
 
  I'm adding a document:
 
  adddocfield name=id99/fieldfield
  name=nameApple/field/doc/add
 
  commit/
 
 
  Queriyng name:apple - 0 results. Searching name:Apple - 1 result.
 But
  name:appl* - 1 result
 
 
  Adding next document:
 
  adddocfield name=id8/fieldfield
  name=nameSomenamele/field/doc/add
 
  commit/
 
 
  Searching for name:somenamele - 1 result, for name:Somenamele - 1
 result
 
 
  What is the problem with Apple ? Maybe StandardTokenizer understands
 it as
  trademark :) ?
 
 
  Thank you in advence
 
 
  --
  Best regards,
  Traut
 




-- 
Best regards,
Traut


Re: Strange behavior

2008-02-12 Thread Yonik Seeley
Try putting the stemmer after the lowercase filter.
-Yonik

On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote:
 Hi all

 Please take a look at this strange behavior (connected with stemming I
 suppose):


 type:

 fieldtype name=customTextField class=solr.TextField indexed=true
 stored=false
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true words=
 stopwords.txt/
 filter class=solr.EnglishPorterFilterFactory protected=
 protwords.txt/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true words=
 stopwords.txt/
 filter class=solr.EnglishPorterFilterFactory protected=
 protwords.txt/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldtype

 field:

 field name=name  type=customTextField indexed=true  stored=false/



 I'm adding a document:

 adddocfield name=id99/fieldfield
 name=nameApple/field/doc/add

 commit/


 Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But
 name:appl* - 1 result


 Adding next document:

 adddocfield name=id8/fieldfield
 name=nameSomenamele/field/doc/add

 commit/


 Searching for name:somenamele - 1 result, for name:Somenamele - 1 result


 What is the problem with Apple ? Maybe StandardTokenizer understands it as
 trademark :) ?


 Thank you in advence


 --
 Best regards,
 Traut



Re: SolrJ and Unique Doc ID

2008-02-12 Thread Grant Ingersoll


On Feb 11, 2008, at 11:24 PM, Chris Hostetter wrote:

: Another option is to add it to the responseHeader  Or it could  
be a quick
: add to the LukeRH.  The former has the advantage that we wouldn't  
have to make


adding the info to LukeRequestHandler makes sense.

Honestly: i can't think of a single use case where client code would  
care

about what the uniqueKey field is, unless it already *knew* what the
uniqueKey field is.


:-)  Abstractions allow one to use different implementations.  My  
client/display doesn't know about Solr, it just knows it can search  
and the Solr implementation part of it can be pointed at any Solr  
instance (or other search engines as well), thus it needs to be able  
to reflect on Solr.  The unique key is a pretty generally useful  
thing across implementations.


In fact, I wish all the ReqHandlers had an introspection option, where  
one could see what params are supported as well.





: Of course, it probably would be useful to be able to request the  
schema from
: the server and build an IndexSchema object on the client side.   
This could be

: added to the LukeRH as well.

somebody was working on that at some point ... but i may be thinking  
of
the Ruby client ... no i'm pretty sure i remember it coming up  
in the
context of Java because i remember dicsussion that a full  
IndexSchema
was too much because it required the client to have the class files  
for

all of the analysis chain and filedtype classes.


It may be reasonable, as a compromise, to just have metadata about  
these things.  Sort of like BeanInfo provides.


-Grant


Setting the schema files

2008-02-12 Thread Aditi Goyal
Hi,

I am using the SOLR searching in my project. I am actually little bit
confused about how the schema works.
Can you please provide me the documentation where I can define how should my
query work?
Like, I want that a, and, the etc should not be searched. Also, it should
not spilt on case change. And it should not look for the sub words. I mean
it should completely search the word and not partially.

Thanks for the help.

Regards,
Aditi


upgrading to lucene 2.3

2008-02-12 Thread Robert Young
I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as
replacing the lucene jar and restarting. Is this the case? Has anyone
had any experience with upgrading lucene to 2.3? Did you have any
problems? Is there anything I should be looking out for?

Thanks
Rob


Re: upgrading to lucene 2.3

2008-02-12 Thread Yonik Seeley
On Feb 12, 2008 9:25 AM, Robert Young [EMAIL PROTECTED] wrote:
 ok, and to do the change I just replace the jar directly in
 sorl/WEB_INF/lib and restart tomcat?

That should work.

-Yonik


Re: Performance help for heavy indexing workload

2008-02-12 Thread Erick Erickson
Well, the *first* sort to the underlying Lucene engine is expensive since
it builds up the terms to sort. I wonder if you're closing and opening the
underlying searcher for every request? This is a definite limiter.

Disclaimer: I mostly do Lucene, not SOLR (yet), so don't *even* ask
me how to change this behavior G. But your comment about
frequent updates to the index prompted this question

Best
Erick

On Feb 12, 2008 3:54 AM, James Brady [EMAIL PROTECTED] wrote:

 Hi again,
 More analysis showed that the extraordinarily long query times only
 appeared when I specify a sort. A concrete example:

 For a querystring such as: ?indent=onversion=2.2q=apache+user_id%
 3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther=
 The QTime is ~500ms.
 For a querystring such as: ?indent=onversion=2.2q=apache+user_id%
 3A39start=0rows=1fl=*%
 2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc
 The QTime is ~75s

 I.e. I am using the StandardRequestHandler to search for a user
 entered term (apache above) and filtering by a user_id field.

 This seems to be the case for every sort option except score asc and
 score desc. Please tell me Solr doesn't sort all matching documents
 before applying boolean filters?

 James

 Begin forwarded message:

  From: James Brady [EMAIL PROTECTED]
  Date: 11 February 2008 23:38:16 GMT-08:00
  To: solr-user@lucene.apache.org
  Subject: Performance help for heavy indexing workload
 
  Hello,
  I'm looking for some configuration guidance to help improve
  performance of my application, which tends to do a lot more
  indexing than searching.
 
  At present, it needs to index around two documents / sec - a
  document being the stripped content of a webpage. However,
  performance was so poor that I've had to disable indexing of the
  webpage content as an emergency measure. In addition, some search
  queries take an inordinate length of time - regularly over 60 seconds.
 
  This is running on a medium sized EC2 instance (2 x 2GHz Opterons
  and 8GB RAM), and there's not too much else going on on the box. In
  total, there are about 1.5m documents in the index.
 
  I'm using a fairly standard configuration - the things I've tried
  changing so far have been parameters like maxMergeDocs, mergeFactor
  and the autoCommit options. I'm only using the
  StandardRequestHandler, no faceting. I have a scheduled task
  causing a database commit every 15 seconds.
 
  Obviously, every workload varies, but could anyone comment on
  whether this sort of hardware should, with proper configuration, be
  able to manage this sort of workload?
 
  I can't see signs of Solr being IO-bound, CPU-bound or memory-
  bound, although my scheduled commit operation, or perhaps GC, does
  spike up the CPU utilisation at intervals.
 
  Any help appreciated!
  James




Re: Filter Query

2008-02-12 Thread Shalin Shekhar Mangar
Using q=NAME:Smithfq=AGE:30 would be better because filter queries
are cached separately and can be re-used regardless of the NAME query.
So if you expect your filter queries to be re-used, you should use fq,
otherwise performance would probably be the same for both NAME:Smith
AND AGE:30 and q=NAME:Smithfq=AGE:30

On Feb 13, 2008 1:31 AM, Evgeniy Strokin [EMAIL PROTECTED] wrote:
 Hello,.. Lets say I have one query like this:
 NAME:Smith
 I need to restrict the result and I'm doing this:
 NAME:Smith AND AGE:30
 Also, I can do this using fq parameter:
 q=NAME:Smithfq=AGE:30
 The result of second and third queries should be the same, right?
 But why should I use fq then? In which cases this is better? Can you give me 
 example to better understand the problem?

 Thank you
 Gene



-- 
Regards,
Shalin Shekhar Mangar.