Fwd: Performance help for heavy indexing workload
Hi again, More analysis showed that the extraordinarily long query times only appeared when I specify a sort. A concrete example: For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther= The QTime is ~500ms. For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*% 2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc The QTime is ~75s I.e. I am using the StandardRequestHandler to search for a user entered term (apache above) and filtering by a user_id field. This seems to be the case for every sort option except score asc and score desc. Please tell me Solr doesn't sort all matching documents before applying boolean filters? James Begin forwarded message: From: James Brady [EMAIL PROTECTED] Date: 11 February 2008 23:38:16 GMT-08:00 To: solr-user@lucene.apache.org Subject: Performance help for heavy indexing workload Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory- bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James
Re: SolrJ and Unique Doc ID
: Honestly: i can't think of a single use case where client code would care : about what the uniqueKey field is, unless it already *knew* what the : uniqueKey field is. : : :-) Abstractions allow one to use different implementations. My : client/display doesn't know about Solr, it just knows it can search and the : Solr implementation part of it can be pointed at any Solr instance (or other : search engines as well), thus it needs to be able to reflect on Solr. The : unique key is a pretty generally useful thing across implementations. but why does your client/display care which field is the uniqueKey field? knowing which fields it might query or ask for in the fl list sure -- but why need to know about hte uniqueKey field specificly? I could have an index of people where i document thatthe SSN field is unique, and never even tell you that it's not the 'uniqueKey' Field -- that could be some completley unrelated field i don't want you to know about called customerId -- but that doesn't acceft you as a client, you can still query on whatever you wnat, get back whatever docs you want, etc... the onlything you can't do is delete by id (since you can't be sure which field is the uniqueKey) but you can always delete by query. : In fact, I wish all the ReqHandlers had an introspection option, where one : could see what params are supported as well. you and me both -- but the introspection shouldn't be intrinsic to the ReuestHandler - as the Solr admin i may not want to expose all of those options to my clients... http://wiki.apache.org/solr/MakeSolrMoreSelfService -Hoss
Filter Query
Hello,.. Lets say I have one query like this: NAME:Smith I need to restrict the result and I'm doing this: NAME:Smith AND AGE:30 Also, I can do this using fq parameter: q=NAME:Smithfq=AGE:30 The result of second and third queries should be the same, right? But why should I use fq then? In which cases this is better? Can you give me example to better understand the problem? Thank you Gene
Re: Performance help for heavy indexing workload
On 11-Feb-08, at 11:38 PM, James Brady wrote: Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. By database commit do you mean solr commit? If so, that is far too frequent if you are sorting on big fields. I use Solr to serve queries for ~10m docs on a medium size EC2 instance. This is an optimized configuration where highlighting is broken off into a separate index, and load balanced into two subindices of 5m docs a piece. I do a good deal of faceting but no sorting. The only reason that this is possible is that the index is only updated every few days. On another box we have a several hundred thousand document index which is updated relatively frequently (autocommit time: 20s). These are merged with the static-er index to create an illusion of real- time index updates. When lucene supports efficient, reopen()able fieldcache upates, this situation might improve, but the above architecture would still probably be better. Note that the second index can be on the same machine. -Mike
Re: SolrJ and Unique Doc ID
On Feb 12, 2008, at 3:44 PM, Grant Ingersoll wrote: On Feb 12, 2008, at 2:10 PM, Chris Hostetter wrote: : Honestly: i can't think of a single use case where client code would care : about what the uniqueKey field is, unless it already *knew* what the : uniqueKey field is. : : :-) Abstractions allow one to use different implementations. My : client/display doesn't know about Solr, it just knows it can search and the : Solr implementation part of it can be pointed at any Solr instance (or other : search engines as well), thus it needs to be able to reflect on Solr. The : unique key is a pretty generally useful thing across implementations. but why does your client/display care which field is the uniqueKey field? knowing which fields it might query or ask for in the fl list sure -- but why need to know about hte uniqueKey field specificly? How do I generate URLs to retrieve a document against any given Solr instance that I happen to be pointing at without knowing which field is the document id? One cool technique, not instead of your change to Luke RH (a needed change IMO) but another way to go about it - we have a DocumentRequestHandler that takes a uniqueKey parameter that would retrieve and return that single document without having to specify the field name explicitly. Erik
RE: Performance help for heavy indexing workload
1) autowarming: it means that if you have a cached query or similar, and do a commit, it then reloads each cached query. This is in solrconfig.xml 2) sorting is a pig. A sort creates an array of N integers where N is the size of the index, not the query. If the sorted field is anything but an integer, a second array of size N is created with a copy of the field's contents. If you want a field to sort fast, you have to make it an int or make an integer-format shadow field. 3) Large query return sets cause out-of-memory exceptions. If the Solr is only doing queries, this is OK: the instance keeps working. We find that if the Solr is also indexing when you hit an out-of-memory, the instance is unusueable until you restart the Java container. This is with Tomcat 5 and Linux RHEL4 with the standard Linux file system. 4) This can also be done by having one index. You do a mass delete on stuff from 8 days ago. There is a larger IT commitment in running multiple Solrs or Lucene files. This is not Oracle or MySQL, where it is well-behaved and you get cute little UIs to run everything. A large Solr index with continuous indexing is not a turnkey application. 5) Be sure to check out 'filters'. These are really useful for trimming queries if you have commonly used subsets of the index, like language = English. We were new to Solr and Lucene and transferred over a several-million-record index from FAST in 3 weeks. There is a learning curve, but it is an impressive app. Lance -Original Message- From: James Brady [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 12, 2008 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Performance help for heavy indexing workload Hi - thanks to everyone for their responses. A couple of extra pieces of data which should help me optimise - documents are very rarely updated once in the index, and I can throw away index data older than 7 days. So, based on advice from Mike and Walter, it seems my best option will be to have seven separate indices. 6 indices will never change and hold data from the six previous days. One index will change and will hold data from the current day. Deletions and updates will be handled by effectively storing a revocation list in the mutable index. In this way, I will only need to perform Solr commits (yes, I did mean Solr commits rather than database commits below - my apologies) on the current day's index, and closing and opening new searchers for these commits shouldn't be as painful as it is currently. To do this, I need to work out how to do the following: - parallel multi search through Solr - move to a new index on a scheduled basis (probably commit and optimise the index at this point) - ideally, properly warm new searchers in the background to further improve search performance on the changing index Does that sound like a reasonable strategy in general, and has anyone got advice on the specific points I raise above? Thanks, James On 12 Feb 2008, at 11:45, Mike Klaas wrote: On 11-Feb-08, at 11:38 PM, James Brady wrote: Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. By database commit do you mean solr commit? If so, that is far too frequent if you are sorting on big fields. I use Solr to serve queries for ~10m docs on a medium size EC2 instance. This is an optimized configuration where highlighting is broken off into a separate index, and load balanced into two subindices of 5m docs a piece. I do a good deal of faceting but no sorting. The only reason that this is possible is that the index is only updated every few days. On another box we have a several hundred thousand document index which is updated relatively frequently (autocommit time: 20s). These are merged with the static-er index to create an illusion of real-time index updates. When lucene supports efficient, reopen()able fieldcache upates, this situation might improve, but the above architecture would still probably be better. Note that the second index can be on the same machine. -Mike
Using embedded Solr with admin GUI
Hi all, We're moving towards embedding multiple Solr cores, versus using multiple Solr webapps, as a way of simplifying our build/deploy and also getting more control over the startup/update process. But I'd hate to lose that handy GUI for inspecting the schema and (most importantly) trying out queries with explain turned on. Has anybody tried this dual-mode method of operation? Thoughts on whether it's workable, and what the issues would be? I've taken a quick look at the .jsp and supporting Java code, and have some ideas on what would be needed, but I'm hoping there's an easy(er) approach than just whacking at the admin support code. Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: what is searcher
Searcher is the main search abstraction in Lucene. It defines the methods used for querying an underlying index(es). See: http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/search/Searcher.html On Feb 12, 2008 10:33 PM, Mochamad bahri nurhabbibi [EMAIL PROTECTED] wrote: hello all.. I am learning SOLR since 2 days ago. I have to make training/presentation aboutSOLR to rest of my fellow in my company. my question is: what is searcher ? this term seems to be found everywhere. but there's no exact definition of this term either in google nor SOLR wiki. anyone please help me.. thank you regards - habibi- -- Conscious decisions by conscious minds are what make reality real
what is searcher
hello all.. I am learning SOLR since 2 days ago. I have to make training/presentation aboutSOLR to rest of my fellow in my company. my question is: what is searcher ? this term seems to be found everywhere. but there's no exact definition of this term either in google nor SOLR wiki. anyone please help me.. thank you regards - habibi-
Re: Performance help for heavy indexing workload
Hi - thanks to everyone for their responses. A couple of extra pieces of data which should help me optimise - documents are very rarely updated once in the index, and I can throw away index data older than 7 days. So, based on advice from Mike and Walter, it seems my best option will be to have seven separate indices. 6 indices will never change and hold data from the six previous days. One index will change and will hold data from the current day. Deletions and updates will be handled by effectively storing a revocation list in the mutable index. In this way, I will only need to perform Solr commits (yes, I did mean Solr commits rather than database commits below - my apologies) on the current day's index, and closing and opening new searchers for these commits shouldn't be as painful as it is currently. To do this, I need to work out how to do the following: - parallel multi search through Solr - move to a new index on a scheduled basis (probably commit and optimise the index at this point) - ideally, properly warm new searchers in the background to further improve search performance on the changing index Does that sound like a reasonable strategy in general, and has anyone got advice on the specific points I raise above? Thanks, James On 12 Feb 2008, at 11:45, Mike Klaas wrote: On 11-Feb-08, at 11:38 PM, James Brady wrote: Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. By database commit do you mean solr commit? If so, that is far too frequent if you are sorting on big fields. I use Solr to serve queries for ~10m docs on a medium size EC2 instance. This is an optimized configuration where highlighting is broken off into a separate index, and load balanced into two subindices of 5m docs a piece. I do a good deal of faceting but no sorting. The only reason that this is possible is that the index is only updated every few days. On another box we have a several hundred thousand document index which is updated relatively frequently (autocommit time: 20s). These are merged with the static-er index to create an illusion of real-time index updates. When lucene supports efficient, reopen()able fieldcache upates, this situation might improve, but the above architecture would still probably be better. Note that the second index can be on the same machine. -Mike
Re: 2D Facet
Chris, I'm very interested to implement generic multidimensional faceting. But I'm not an expert in Solr, but I'm very good with Java. So I need little bit more directions if you don't mind. I promise to share my code and if you'll be Ok with it you are welcome to use it. So, Lets say I have a parameter facet.field=STATE. For example we'll take 3D faceting, so I'll need 2 more facet fields related to the first one. Should we do something like this: facet.field=STATEf.STATE.facet.matrix=NAMEf.STATE.facet.matrix=INCOME Or for example we can have may be like this: facet.matrix=STATE,NAME,INCOME What would you suggest is better? Also, where in Solr I could find something similar to take it as an example? Where all this logic should be placed? Thank you Gene - Original Message From: Chris Hostetter [EMAIL PROTECTED] To: Solr User solr-user@lucene.apache.org Sent: Thursday, January 17, 2008 1:12:32 AM Subject: Re: 2D Facet : : Hello, is this possible to do in one query: I have a query which returns : 1000 documents with names and addresses. I can run facet on state field : and see how many addresses I have in each state. But also I need to see : how many families lives in each state. So as a result I need a matrix of : states on top and Last Names on right. After my first query, knowing : which states I have I can run queries on each state using facet field : Last_Name. But I guess this is not an efficient way. Is this possible to : get in one query? Or may be some other way? if you set rows=0 on all of those queries it won't be horribly inefficient ... the DocSets for each state and lastname should wind up in the filterCache, so most of the queries will just be simple DocSet intersections with only the HTTP overhead (which if you use persistent connections should be fairly minor) The idea of generic multidimensional faceting is acctaully pretty interesting ... it could be done fairly simply -- imagine if for every facet.field=foo param, solr checked for a f.foo.facet.matrix params, and once the top facet.limit terms were found for field foo it then computed the top facet founds for each f.foo.facet.matrix field with an implicit fq=foo:term. that would be pretty cool. -Hoss
Re: upgrading to lucene 2.3
See: https://issues.apache.org/jira/browse/SOLR-330 https://issues.apache.org/jira/browse/SOLR-342 for various solutions around taking advantage of Lucene's new capabilities. -Grant On Feb 12, 2008, at 1:15 PM, Yonik Seeley wrote: On Feb 12, 2008 1:06 PM, Lance Norskog [EMAIL PROTECTED] wrote: What will this improve? Text analysis may be slower since Solr won't have the changes to use the faster Token APIs. Indexing overall should still be faster. Querying should see little change. -Yonik
Re: upgrading to lucene 2.3
On Feb 12, 2008 1:06 PM, Lance Norskog [EMAIL PROTECTED] wrote: What will this improve? Text analysis may be slower since Solr won't have the changes to use the faster Token APIs. Indexing overall should still be faster. Querying should see little change. -Yonik
RE: upgrading to lucene 2.3
What will this improve? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, February 12, 2008 6:48 AM To: solr-user@lucene.apache.org Subject: Re: upgrading to lucene 2.3 On Feb 12, 2008 9:25 AM, Robert Young [EMAIL PROTECTED] wrote: ok, and to do the change I just replace the jar directly in sorl/WEB_INF/lib and restart tomcat? That should work. -Yonik
Re: SolrJ and Unique Doc ID
On Feb 12, 2008, at 2:10 PM, Chris Hostetter wrote: : Honestly: i can't think of a single use case where client code would care : about what the uniqueKey field is, unless it already *knew* what the : uniqueKey field is. : : :-) Abstractions allow one to use different implementations. My : client/display doesn't know about Solr, it just knows it can search and the : Solr implementation part of it can be pointed at any Solr instance (or other : search engines as well), thus it needs to be able to reflect on Solr. The : unique key is a pretty generally useful thing across implementations. but why does your client/display care which field is the uniqueKey field? knowing which fields it might query or ask for in the fl list sure -- but why need to know about hte uniqueKey field specificly? How do I generate URLs to retrieve a document against any given Solr instance that I happen to be pointing at without knowing which field is the document id? At any rate, the problem is solved in SOLR-478 in less than 10 lines of code and doesn't introduce back-compat. issues. I invoke this on instantiation of my client, get the field and then keep it around for use later. I could have an index of people where i document thatthe SSN field is unique, and never even tell you that it's not the 'uniqueKey' Field -- that could be some completley unrelated field i don't want you to know about called customerId -- but that doesn't acceft you as a client, you can still query on whatever you wnat, get back whatever docs you want, etc... the onlything you can't do is delete by id (since you can't be sure which field is the uniqueKey) but you can always delete by query. : In fact, I wish all the ReqHandlers had an introspection option, where one : could see what params are supported as well. you and me both -- but the introspection shouldn't be intrinsic to the ReuestHandler - as the Solr admin i may not want to expose all of those options to my clients... http://wiki.apache.org/solr/MakeSolrMoreSelfService +1
Re: Strange behavior
On Feb 12, 2008 9:50 AM, Traut [EMAIL PROTECTED] wrote: Thank you, it works. Stemming filter works only with lowercased words? I've never tried it in the order you have it. You could try the analysis admin page and report back what happens... -Yonik On Feb 12, 2008 4:29 PM, Yonik Seeley [EMAIL PROTECTED] wrote: Try putting the stemmer after the lowercase filter. -Yonik On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote: Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut -- Best regards, Traut
Re: Setting the schema files
Aditi Goyal wrote: Hi, I am using the SOLR searching in my project. I am actually little bit confused about how the schema works. Can you please provide me the documentation where I can define how should my query work? Like, I want that a, and, the etc should not be searched. Also, it should not spilt on case change. And it should not look for the sub words. I mean it should completely search the word and not partially. all docs are pointed to from the Documentation link on the left of: http://lucene.apache.org/solr/ perhaps the most important one is: http://wiki.apache.org/solr/ Specifically, it looks like you are looking for the StopFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-9e6f07472dbdf0facc966ac61c25145be1ae0d5d ryan
Re: Performance help for heavy indexing workload
On 2/12/08 7:40 AM, Ken Krugler [EMAIL PROTECTED] wrote: In general immediate updating of an index with a continuous stream of new content, and fast search results, work in opposition. The searcher's various caches are getting continuously flushed to avoid stale content, which can easily kill your performance. One approach is to have a big, rarely-updated index and a small index for new or changed content. Once a day, add everything from the small index into the big one. You may need external bookkeeping for deleted documents. Another trick from Infoseek. wunder
Re: Performance help for heavy indexing workload
That does seem really slow. Is the index on NFS-mounted storage? wunder On 2/12/08 7:04 AM, Erick Erickson [EMAIL PROTECTED] wrote: Well, the *first* sort to the underlying Lucene engine is expensive since it builds up the terms to sort. I wonder if you're closing and opening the underlying searcher for every request? This is a definite limiter. Disclaimer: I mostly do Lucene, not SOLR (yet), so don't *even* ask me how to change this behavior G. But your comment about frequent updates to the index prompted this question Best Erick On Feb 12, 2008 3:54 AM, James Brady [EMAIL PROTECTED] wrote: Hi again, More analysis showed that the extraordinarily long query times only appeared when I specify a sort. A concrete example: For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther= The QTime is ~500ms. For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*% 2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc The QTime is ~75s I.e. I am using the StandardRequestHandler to search for a user entered term (apache above) and filtering by a user_id field. This seems to be the case for every sort option except score asc and score desc. Please tell me Solr doesn't sort all matching documents before applying boolean filters? James Begin forwarded message: From: James Brady [EMAIL PROTECTED] Date: 11 February 2008 23:38:16 GMT-08:00 To: solr-user@lucene.apache.org Subject: Performance help for heavy indexing workload Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory- bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James
Re: Fwd: Performance help for heavy indexing workload
Hi James, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. In general immediate updating of an index with a continuous stream of new content, and fast search results, work in opposition. The searcher's various caches are getting continuously flushed to avoid stale content, which can easily kill your performance. This issue was one of the more interesting topics discussed during the Lucene BoF meeting at ApacheCon. You're not alone in wanting to have it both ways, but it's clear this is A Hard Problem. If you can relax the need for immediate updates to the index, and accept some level of lag time between receiving new content and this showing up in the index, then I'd suggest splitting the two processes. Have a backend system that deals with updates, and then at some slower interval update the search index. -- Ken This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory-bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
wildcard query question
I have indexed a field called courseTitle of 'text' type (as in the schema.xml but without the stemming factory) that contains COBOL: Data Structure Searching with a wildcard query like courseTitle:cobol\:* AND courseTitle:data* AND courseTitle:structure* (the colon character : is escaped) the record is not found. If the search is courseTitle:cobol* AND courseTitle:data* AND courseTitle:structure* the record is found. I was wondering how the colon character affects the search, and if there is another way to write a wildcard query. Thanks. . The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. This message is privileged and confidential. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited.
RE: upgrading to lucene 2.3
I did the same: Stopped SOLR-1.2, replaced Lucene jars, started SOLR-1.2 No any problem. -Original Message- From: Robert Young [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 12, 2008 9:25 AM To: solr-user@lucene.apache.org Subject: Re: upgrading to lucene 2.3 ok, and to do the change I just replace the jar directly in sorl/WEB_INF/lib and restart tomcat? Thanks Rob On Feb 12, 2008 1:55 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Solr Trunk is using the latest Lucene version. Also note there are a couple edge cases in Lucene 2.3 that are causing problems if you use SOLR-342 with lucenAutoCommit == false. But, yes, you should be able to drop in 2.3, as that is one of the back-compatible goals for Lucene minor releases. -Grant On Feb 12, 2008, at 8:06 AM, Robert Young wrote: I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as replacing the lucene jar and restarting. Is this the case? Has anyone had any experience with upgrading lucene to 2.3? Did you have any problems? Is there anything I should be looking out for? Thanks Rob
Re: upgrading to lucene 2.3
ok, and to do the change I just replace the jar directly in sorl/WEB_INF/lib and restart tomcat? Thanks Rob On Feb 12, 2008 1:55 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Solr Trunk is using the latest Lucene version. Also note there are a couple edge cases in Lucene 2.3 that are causing problems if you use SOLR-342 with lucenAutoCommit == false. But, yes, you should be able to drop in 2.3, as that is one of the back-compatible goals for Lucene minor releases. -Grant On Feb 12, 2008, at 8:06 AM, Robert Young wrote: I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as replacing the lucene jar and restarting. Is this the case? Has anyone had any experience with upgrading lucene to 2.3? Did you have any problems? Is there anything I should be looking out for? Thanks Rob
Strange behavior
Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut
Re: upgrading to lucene 2.3
Solr Trunk is using the latest Lucene version. Also note there are a couple edge cases in Lucene 2.3 that are causing problems if you use SOLR-342 with lucenAutoCommit == false. But, yes, you should be able to drop in 2.3, as that is one of the back-compatible goals for Lucene minor releases. -Grant On Feb 12, 2008, at 8:06 AM, Robert Young wrote: I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as replacing the lucene jar and restarting. Is this the case? Has anyone had any experience with upgrading lucene to 2.3? Did you have any problems? Is there anything I should be looking out for? Thanks Rob
RE: Commit preformance problem
Or, if you have multiple files to be updated, please make sure Index Multiple Files and commit Once at the end of Indexing.. Jae -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 12, 2008 10:50 AM To: solr-user@lucene.apache.org Subject: RE: Commit preformance problem I have same experience.. I do have 6.5G Index and update it daily. Have you ever check that the updated file does not have any document and tried commit? I don't know why, but it takes so long - more than 10 minutes. Jae Joo -Original Message- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 12, 2008 10:34 AM To: solr-user@lucene.apache.org Subject: Re: Commit preformance problem I have a large solr index that is currently about 6 GB and is suffering of severe performance problems during updates. A commit can take over 10 minutes to complete. I have tried to increase max memory to the JVM to over 6 GB, but without any improvement. I have also tried to turn off waitSearcher and waitFlush, which do significantly improve the commit speed. However, the max number of searchers is then quickly reached. If you have a large index, then I'd recommend having a separate Solr installation that you use to update/commit changes, after which you use snappuller or equivalent to swap it in to the live (search) system. Would a switch to another container (currently using Jetty) make any difference? Very unlikely. Does anyone have any other tip for improving the performance? Switch to Lucene 2.3, and tune the new parameters that control memory usage during updating. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
RE: Commit preformance problem
I have same experience.. I do have 6.5G Index and update it daily. Have you ever check that the updated file does not have any document and tried commit? I don't know why, but it takes so long - more than 10 minutes. Jae Joo -Original Message- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 12, 2008 10:34 AM To: solr-user@lucene.apache.org Subject: Re: Commit preformance problem I have a large solr index that is currently about 6 GB and is suffering of severe performance problems during updates. A commit can take over 10 minutes to complete. I have tried to increase max memory to the JVM to over 6 GB, but without any improvement. I have also tried to turn off waitSearcher and waitFlush, which do significantly improve the commit speed. However, the max number of searchers is then quickly reached. If you have a large index, then I'd recommend having a separate Solr installation that you use to update/commit changes, after which you use snappuller or equivalent to swap it in to the live (search) system. Would a switch to another container (currently using Jetty) make any difference? Very unlikely. Does anyone have any other tip for improving the performance? Switch to Lucene 2.3, and tune the new parameters that control memory usage during updating. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: Commit preformance problem
I have a large solr index that is currently about 6 GB and is suffering of severe performance problems during updates. A commit can take over 10 minutes to complete. I have tried to increase max memory to the JVM to over 6 GB, but without any improvement. I have also tried to turn off waitSearcher and waitFlush, which do significantly improve the commit speed. However, the max number of searchers is then quickly reached. If you have a large index, then I'd recommend having a separate Solr installation that you use to update/commit changes, after which you use snappuller or equivalent to swap it in to the live (search) system. Would a switch to another container (currently using Jetty) make any difference? Very unlikely. Does anyone have any other tip for improving the performance? Switch to Lucene 2.3, and tune the new parameters that control memory usage during updating. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Commit preformance problem
I have a large solr index that is currently about 6 GB and is suffering of severe performance problems during updates. A commit can take over 10 minutes to complete. I have tried to increase max memory to the JVM to over 6 GB, but without any improvement. I have also tried to turn off waitSearcher and waitFlush, which do significantly improve the commit speed. However, the max number of searchers is then quickly reached. Would a switch to another container (currently using Jetty) make any difference? Does anyone have any other tip for improving the performance? TIA, Anders
Re: Strange behavior
Thank you, it works. Stemming filter works only with lowercased words? On Feb 12, 2008 4:29 PM, Yonik Seeley [EMAIL PROTECTED] wrote: Try putting the stemmer after the lowercase filter. -Yonik On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote: Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut -- Best regards, Traut
Re: Strange behavior
Try putting the stemmer after the lowercase filter. -Yonik On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote: Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut
Re: SolrJ and Unique Doc ID
On Feb 11, 2008, at 11:24 PM, Chris Hostetter wrote: : Another option is to add it to the responseHeader Or it could be a quick : add to the LukeRH. The former has the advantage that we wouldn't have to make adding the info to LukeRequestHandler makes sense. Honestly: i can't think of a single use case where client code would care about what the uniqueKey field is, unless it already *knew* what the uniqueKey field is. :-) Abstractions allow one to use different implementations. My client/display doesn't know about Solr, it just knows it can search and the Solr implementation part of it can be pointed at any Solr instance (or other search engines as well), thus it needs to be able to reflect on Solr. The unique key is a pretty generally useful thing across implementations. In fact, I wish all the ReqHandlers had an introspection option, where one could see what params are supported as well. : Of course, it probably would be useful to be able to request the schema from : the server and build an IndexSchema object on the client side. This could be : added to the LukeRH as well. somebody was working on that at some point ... but i may be thinking of the Ruby client ... no i'm pretty sure i remember it coming up in the context of Java because i remember dicsussion that a full IndexSchema was too much because it required the client to have the class files for all of the analysis chain and filedtype classes. It may be reasonable, as a compromise, to just have metadata about these things. Sort of like BeanInfo provides. -Grant
Setting the schema files
Hi, I am using the SOLR searching in my project. I am actually little bit confused about how the schema works. Can you please provide me the documentation where I can define how should my query work? Like, I want that a, and, the etc should not be searched. Also, it should not spilt on case change. And it should not look for the sub words. I mean it should completely search the word and not partially. Thanks for the help. Regards, Aditi
upgrading to lucene 2.3
I have heard that upgrading to lucene 2.3 in Solr 1.2 is as simple as replacing the lucene jar and restarting. Is this the case? Has anyone had any experience with upgrading lucene to 2.3? Did you have any problems? Is there anything I should be looking out for? Thanks Rob
Re: upgrading to lucene 2.3
On Feb 12, 2008 9:25 AM, Robert Young [EMAIL PROTECTED] wrote: ok, and to do the change I just replace the jar directly in sorl/WEB_INF/lib and restart tomcat? That should work. -Yonik
Re: Performance help for heavy indexing workload
Well, the *first* sort to the underlying Lucene engine is expensive since it builds up the terms to sort. I wonder if you're closing and opening the underlying searcher for every request? This is a definite limiter. Disclaimer: I mostly do Lucene, not SOLR (yet), so don't *even* ask me how to change this behavior G. But your comment about frequent updates to the index prompted this question Best Erick On Feb 12, 2008 3:54 AM, James Brady [EMAIL PROTECTED] wrote: Hi again, More analysis showed that the extraordinarily long query times only appeared when I specify a sort. A concrete example: For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther= The QTime is ~500ms. For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*% 2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc The QTime is ~75s I.e. I am using the StandardRequestHandler to search for a user entered term (apache above) and filtering by a user_id field. This seems to be the case for every sort option except score asc and score desc. Please tell me Solr doesn't sort all matching documents before applying boolean filters? James Begin forwarded message: From: James Brady [EMAIL PROTECTED] Date: 11 February 2008 23:38:16 GMT-08:00 To: solr-user@lucene.apache.org Subject: Performance help for heavy indexing workload Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory- bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James
Re: Filter Query
Using q=NAME:Smithfq=AGE:30 would be better because filter queries are cached separately and can be re-used regardless of the NAME query. So if you expect your filter queries to be re-used, you should use fq, otherwise performance would probably be the same for both NAME:Smith AND AGE:30 and q=NAME:Smithfq=AGE:30 On Feb 13, 2008 1:31 AM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello,.. Lets say I have one query like this: NAME:Smith I need to restrict the result and I'm doing this: NAME:Smith AND AGE:30 Also, I can do this using fq parameter: q=NAME:Smithfq=AGE:30 The result of second and third queries should be the same, right? But why should I use fq then? In which cases this is better? Can you give me example to better understand the problem? Thank you Gene -- Regards, Shalin Shekhar Mangar.