Re: advice on creating a solr index when data source is from many unrelated db tables
Hi Ahmed, fields that are empty do not impact the index. It's different from a database. I have text fields for different languages and per document there is always only one of the languages set (the text fields for the other languages are empty/not set). It works all very well and fast. I wonder more about what you describe as unrelated data - why would you want to put unrelated data into a single index? If you want to search on all the data and return mixed results there surely must be some kind of relation between the documents? Chantal On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote: I understand (and its straightforward) when you want to create a index for something simple like Products. But how do you go about creating a Solr index when you have data coming from 10-15 database tables, and the tables have unrelated data? The issue is then you would have many 'columns' in your index, and they will be NULL for much of the data since you are trying to shove 15 db tables into a single Solr/Lucense index. This must be a common problem, what are the potential solutions?
Re: advice on creating a solr index when data source is from many unrelated db tables
On Thu, 29 Jul 2010 15:33:42 -0400 S Ahmed sahmed1...@gmail.com wrote: I understand (and its straightforward) when you want to create a index for something simple like Products. But how do you go about creating a Solr index when you have data coming from 10-15 database tables, and the tables have unrelated data? The issue is then you would have many 'columns' in your index, and they will be NULL for much of the data since you are trying to shove 15 db tables into a single Solr/Lucense index. [...] This should not be a problem. With the Solr DataImportHandler, any NULL values for a given record will simply be ignored, i.e., the Solr index for that document will not contain an entry for that field. Regards, Gora
Re: Get unique values
2010/7/28 Rafal Bluszcz Zawadzki ra...@headnet.dk Hi, In my schema I have (inter ali) fields CollectionID, and CollectionName. These two values always match together, which means that for every value of CollectionID there is matching value from CollectionName. I am interested in query which allow me to get unique values of CollectionID with matching CollectionNames (rest of fields is not interested for me in this query). Finally I decided to store values in one indexed field (Collections) and below query did the trick: q=*:*rows=0facet=onfacet.field=Collections -- Rafał Zawadzki http://dev.bluszcz.net/
Re: wildcard and proximity searches
What approach shoud I use to perform wildcard and proximity searches? Like: solr mail*~10 For getting docs where solr is within 10 words of mailing for instance? You can do it with the plug-in described here: https://issues.apache.org/jira/browse/SOLR-1604 It would be great if you test it and give feedback.
Solr and Lucene in South Africa
Hi to all Solr/Lucene Users... Out team had a discussion today regarding the Solr/Lucene community closer to home. I am hereby putting out an SOS to all Solr/Lucene users in the South African market and wish to organize a meet-up (or user support group) if at all possible. It would be great to share some triumphs and pitfalls that were experienced. * Sorry for hogging the User Mailing list on non-technical question, but think this is the easiest way to get it done :) Jaco Olivier Web Specialist Please note: This email and its content are subject to the disclaimer as displayed at the following link http://www.sabinet.co.za/?page=e-mail-disclaimer. Should you not have Web access, send an email to i...@sabinet.co.zamailto:i...@sabinet.co.za and a copy will be sent to you
RE: wildcard and proximity searches
Hi Ahmet, Thank you. I'll be happy to test it if I manage to install it ok.. I'm a newbie at solr but I'm going to try the instructions in the thread to load it. Another doubts I have about wildcard searches: a) I think wildcard search is by default case sensitive? Is there a way to make case insensitive? b) I have about 6000 queries to run (could have widlcards, proximity searches or just normal queries). I discovered that the normal query type doesn't work with wildcards and so I'm using the Filter Query to query these. Is this field slower? I notice that using this field my queries are much slower (I have some queries like *word* or *word1* or *word2* that take about one minute to perform) Is there a way to optimize these queries (without removing the wildcards :))? c)Is there a way to do phrase queries with wildcards? Like This solr* mail*? Because the tests I made, when using quotes I think the wildcards are ignored. d)How exactly works the pf (phrase fields) and ps (phrase slop) parameters and what's the difference for the proximity searches (ex: word word2~20)? Sorry for the long email and thank you for your help... Frederico -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: sexta-feira, 30 de Julho de 2010 10:57 To: solr-user@lucene.apache.org Subject: Re: wildcard and proximity searches What approach shoud I use to perform wildcard and proximity searches? Like: solr mail*~10 For getting docs where solr is within 10 words of mailing for instance? You can do it with the plug-in described here: https://issues.apache.org/jira/browse/SOLR-1604 It would be great if you test it and give feedback.
RE: wildcard and proximity searches
a) I think wildcard search is by default case sensitive? Is there a way to make case insensitive? Wildcard searches are not analyzed. To case insensitive search you can lowercase query terms at client side. (with using lowercasefilter at index time) e.g. Mail* = mail* I discovered that the normal query type doesn't work with wildcards and so I'm using the Filter Query to query these. I don't understand this. Wildcard search works with q parameter if you are asking that. q=mail* field my queries are much slower (I have some queries like *word* or *word1* or *word2* that take about one minute to perform) Is there a way to optimize these queries (without removing the wildcards :))? It is normal for leading wildcard search to be slow. Using ReversedWildcardFilterFactory at index time can speedup it. But it is unusual to use both leading and trailing * operator. Why are you doing this? c)Is there a way to do phrase queries with wildcards? Like This solr* mail*? Because the tests I made, when using quotes I think the wildcards are ignored. By default it is not supported. With SOLR-1604 is it possible. d)How exactly works the pf (phrase fields) and ps (phrase slop) parameters and what's the difference for the proximity searches (ex: word word2~20)? These parameters are specific to dismax query parser. http://wiki.apache.org/solr/DisMaxQParserPlugin
Re: Can't find org.apache.solr.client.solrj.embedded
Sorry, I had inspected the ...core.jar three times, without recognizing the package. I was realy blind. =8-) thanks Uwe Am 26.07.2010 20:48, schrieb Chris Hostetter: : where is a Jar, containing org.apache.solr.client.solrj.embedded? Classes in the embedded package are useless w/o the rest of the Solr internal core classes, so they are included directly in the apache-solr-core-1.4.1.jar. -Hoss
RE: wildcard and proximity searches
Hi Ahmet, a) I think wildcard search is by default case sensitive? Is there a way to make case insensitive? Wildcard searches are not analyzed. To case insensitive search you can lowercase query terms at client side. (with using lowercasefilter at index time) e.g. Mail* = mail* I discovered that the normal query type doesn't work with wildcards and so I'm using the Filter Query to query these. I don't understand this. Wildcard search works with q parameter if you are asking that. q=mail* For the 2 points above, my bad. I'm already using the lowercasefilter but I was not lowering the query with wildcards (the others are lowered by the analyser). So it's working fine now! On my tests yesterday probably I was testing q=Mail* and fq=mail* (and didn't notice the difference) and read somewhere that it wasn't possible (probably on older solr version) so I get the wrong conclusion that it wasn't working. But it is unusual to use both leading and trailing * operator. Why are you doing this? Yes I know, but I have a few queries that need this. I'll try the ReversedWildcardFilterFactory. By default it is not supported. With SOLR-1604 is it possible. Ok then. I guess SOLR-1604 is the answer for most of my problems. I'm going to give it a try and then I'll share some feedback. Thanks for your help and sorry for my newbie confusions. :) Frederico -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: sexta-feira, 30 de Julho de 2010 12:09 To: solr-user@lucene.apache.org Subject: RE: wildcard and proximity searches a) I think wildcard search is by default case sensitive? Is there a way to make case insensitive? Wildcard searches are not analyzed. To case insensitive search you can lowercase query terms at client side. (with using lowercasefilter at index time) e.g. Mail* = mail* I discovered that the normal query type doesn't work with wildcards and so I'm using the Filter Query to query these. I don't understand this. Wildcard search works with q parameter if you are asking that. q=mail* field my queries are much slower (I have some queries like *word* or *word1* or *word2* that take about one minute to perform) Is there a way to optimize these queries (without removing the wildcards :))? It is normal for leading wildcard search to be slow. Using ReversedWildcardFilterFactory at index time can speedup it. But it is unusual to use both leading and trailing * operator. Why are you doing this? c)Is there a way to do phrase queries with wildcards? Like This solr* mail*? Because the tests I made, when using quotes I think the wildcards are ignored. By default it is not supported. With SOLR-1604 is it possible. d)How exactly works the pf (phrase fields) and ps (phrase slop) parameters and what's the difference for the proximity searches (ex: word word2~20)? These parameters are specific to dismax query parser. http://wiki.apache.org/solr/DisMaxQParserPlugin
Re: Customize order field list ???
I believe they come back alphabetically sorted (not sure if this is language specific or not), so a quick way might be to change the name from createdate to zz_createdate or something like that. Generally with XML you should not be worried about order however. It's usually a sign of a design issue somewhere if the order of the fields matters. -- View this message in context: http://lucene.472066.n3.nabble.com/Customize-order-field-list-tp1007996p1008924.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using Solr to perform range queries in Dspace
Thank you for your reply. This is a background as to what I am trying to achieve. I want to be able to perform a search across numeric index ranges and get the results in logical ordering instead of a lexicographic ordering using dspace. Currently if I do a search using the query: var:[10 TO 50] if there are any values with index 1000, 100 or a float say 10.x the result returns all these values plus any other values that falls within the lexicographic range. Similar result is returned if I enter any other numeric data type. In solr I see where TrieDoubleField,TrieLongField,SortableIntField, etc.. can be use to perform numeric range queries and return the result in logical ordering. I was thinking about using either TrieField classes for int, double etc.. and/or SortableIntField, SortableLongField classes defined in solr to perform range query search in dspace. -Mckeane -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-to-perform-range-queries-in-Dspace-tp987049p1008941.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr searching performance issues, using large documents
hightlight's time is mainly spent on getting the field which you want to highlight and tokenize this field(If you don't store term vector) . you can check what's wrong, 2010/7/30 Peter Spam ps...@mac.com: If I don't do highlighting, it's really fast. Optimize has no effect. -Peter On Jul 29, 2010, at 11:54 AM, dc tech wrote: Are you storing the entire log file text in SOLR? That's almost 3gb of text that you are storing in the SOLR. Try to 1) Is this first time performance or on repaat queries with the same fields? 2) Optimze the index and test performance again 3) index without storing the text and see what the performance looks like. On 7/29/10, Peter Spam ps...@mac.com wrote: Any ideas? I've got 5000 documents with an average size of 850k each, and it sometimes takes 2 minutes for a query to come back when highlighting is turned on! Help! -Pete On Jul 21, 2010, at 2:41 PM, Peter Spam wrote: From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/
Re: question about relevance
Hi, Thanks a lot for the info and your time. I think field collapse will work for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but which file I should use for patch. We use solr-1.3. Thanks Bharat Jain On Fri, Jul 30, 2010 at 12:53 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : 1. There are user records of type A, B, C etc. (userId field in index is : common to all records) : 2. A user can have any number of A, B, C etc (e.g. think of A being a : language then user can know many languages like french, english, german etc) : 3. Records are currently stored as a document in index. : 4. A given query can match multiple records for the user : 5. If for a user more records are matched (e.g. if he knows both french and : german) then he is more relevant and should come top in UI. This is the : reason I wanted to add lucene scores assuming the greater score means more : relevance. if your goal is to get back users from each search, then you should probably change your indexing strategry so that each user has a single document -- fields like langauge can be multivalued, etc... then a search for language:en langauge:fr will return users who speak english or french, and hte ones that speak both will score higher. if you really cant change the index structure, then essentially waht you are looking for is a field collapsing solution on the userId field, where you want each collapsed group to get a cumulative score. i don't know if the existing field collapsing patches support this -- if you are already willing/capable to do it in the lcient then that may be the simplest thing to support moving foward. Adding the scores is certainly one metric you could use -- it's generally suspicious to try and imply too much meaning to scores in lucene/solr but that's becuase people typically try to imply broader absolute meaning. in the case of a single query the scores are relative eachother, and adding up all the scores for a given userId is approximaly what would happen in my example above -- except that there is also a coord factor that would penalalize documents that only match one clause ... it's complicated, but as an approximation adding the scores might give you what you are looking for -- only you can know for sure based on your specific data. -Hoss
Document Boost with Solr Extraction - SolrContentHandler
We are using Solr Extract Handler for indexing document metadata with attachments. (/update/extract) However, the SolrContentHandler doesn't seem to support index time document boost attribute. Probably , document.setDocumentBoost(Float.parseFloat(boost)) is missing. Regards, Jayendra
Re: advice on creating a solr index when data source is from many unrelated db tables
So I have tables like this: Users UserSales UserHistory UserAddresses UserNotes ClientAddress CalenderEvent Articles Blogs Just seems odd to me, jamming on these tables into a single index. But I guess the idea of using a 'type' field to quality exactly what I am searching is a good idea, in case I need to filter for only 'articles' or blogs or contacts etc. But there might be 50 fields if I do this no? On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi Ahmed, fields that are empty do not impact the index. It's different from a database. I have text fields for different languages and per document there is always only one of the languages set (the text fields for the other languages are empty/not set). It works all very well and fast. I wonder more about what you describe as unrelated data - why would you want to put unrelated data into a single index? If you want to search on all the data and return mixed results there surely must be some kind of relation between the documents? Chantal On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote: I understand (and its straightforward) when you want to create a index for something simple like Products. But how do you go about creating a Solr index when you have data coming from 10-15 database tables, and the tables have unrelated data? The issue is then you would have many 'columns' in your index, and they will be NULL for much of the data since you are trying to shove 15 db tables into a single Solr/Lucense index. This must be a common problem, what are the potential solutions?
Re: Solr searching performance issues, using large documents
I do store term vector: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / -Pete On Jul 30, 2010, at 7:30 AM, Li Li wrote: hightlight's time is mainly spent on getting the field which you want to highlight and tokenize this field(If you don't store term vector) . you can check what's wrong, 2010/7/30 Peter Spam ps...@mac.com: If I don't do highlighting, it's really fast. Optimize has no effect. -Peter On Jul 29, 2010, at 11:54 AM, dc tech wrote: Are you storing the entire log file text in SOLR? That's almost 3gb of text that you are storing in the SOLR. Try to 1) Is this first time performance or on repaat queries with the same fields? 2) Optimze the index and test performance again 3) index without storing the text and see what the performance looks like. On 7/29/10, Peter Spam ps...@mac.com wrote: Any ideas? I've got 5000 documents with an average size of 850k each, and it sometimes takes 2 minutes for a query to come back when highlighting is turned on! Help! -Pete On Jul 21, 2010, at 2:41 PM, Peter Spam wrote: From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true
Programmatically retrieving numDocs (or any other statistic)
I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. The only two ways I've come up with are searching for *:* and reporting the hit count, or sending an Http GET to http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat name=numDocs /stat in the response. Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, what's numDocs? (I'm doing this in Python, using Pysolr, if that matters.) Thanks!
Re: Help with schema design
I'd just index the eventtype, eventby and eventtime as separate fields. Then queries something like eventtype:update AND eventtime:[targettime TO *]. Similarly for events update by pramod, the query would be something like: eventby:pramod AND eventtype:update HTH Erick On Wed, Jul 28, 2010 at 11:05 PM, Pramod Goyal pramod.go...@gmail.comwrote: Hi, I have a use case where i get a document and a list of events that has happened on the document. For example First document: Some text content Events: Event TypeEvent By Event Time Update Pramod 06062010 2:30:00 Update Raj 06062010 2:30:00 View Rahul 07062010 1:30:00 I would like to support queries like get all document Event Type = ? and Event time greater than ? , also query like get all the documents Updated by Pramod. How should i design my schema to support this use case. Thanks, Regards, Pramod Goyal
Re: Solr using 1500 threads - is that normal?
Glad to help. Do be aware that there are several config values that influence the commit frequency, they might also be relevant. Best Erick On Thu, Jul 29, 2010 at 5:11 AM, Christos Constantinou ch...@simpleweb.co.uk wrote: Eric, Thank you very much for the indicators! I had a closer look at the commit intervals and it seems that the application is gradually increasing the commits to almost once per second after some time - something that was hidden in the massive amount of queries in the log file. I have changed the code to use commitWithin rather than commit and everything looks much better now. I believe that might have solved the problem so thanks again. Christos On 29 Jul 2010, at 01:44, Erick Erickson wrote: Your commits are very suspect. How often are you making changes to your index? Do you have autocommit on? Do you commit when updating each document? Committing too often and consequently firing off warmup queries is the first place I'd look. But I agree with dc tech, 1,500 is wy more than I would expect. Best Erick On Wed, Jul 28, 2010 at 6:53 AM, Christos Constantinou ch...@simpleweb.co.uk wrote: Hi, Solr seems to be crashing after a JVM exception that new threads cannot be created. I am writing in hope of advice from someone that has experienced this before. The exception that is causing the problem is: Exception in thread btpool0-5 java.lang.OutOfMemoryError: unable to create new native thread The memory that is allocated to Solr is 3072MB, which should be enough memory for a ~6GB data set. The documents are not big either, they have around 10 fields of which only one stores large text ranging between 1k-50k. The top command at the time of the crash shows Solr using around 1500 threads, which I assume it is not normal. Could it be that the threads are crashing one by one and new ones are created to cope with the queries? In the log file, right after the the exception, there are several thousand commits before the server stalls completely. Normally, the log file would report 20-30 document existence queries per second, then 1 commit per 5-30 seconds, and some more infrequent faceted document searches on the data. However after the exception, there are only commits until the end of the log file. I am wondering if anyone has experienced this before or if it is some sort of known bug from Solr 1.4? Is there a way to increase the details of the exception in the logfile? I am attaching the output of a grep Exception command on the logfile. Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:19:32 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:20:18 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:20:48 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:22:43 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:28:50 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. Jul 28, 2010 8:33:19 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new
Re: Solr Indexing slows down
See the subject about 1500 threads. The first place I'd look is how often you're committing. If you're committing before the warmup queries from the previous commit have done their magic, you might be getting into a death spiral. HTH Erick On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote: Hi, I am indexing a solr 1.4.0 core and commiting gets slower and slower. Starting from 3-5 seconds for ~200 documents and ending with over 60 seconds after 800 commits. Then, if I reloaded the index, it is as fast as before! And today I have read a similar thread [1] and indeed: if I set autowarming for the caches to 0 the slowdown disappears. BUT at the same time I would like to offer searching on that core, which would be dramatically slowed down (due to no autowarming). Does someone know a better solution to avoid index-slow-down? Regards, Peter. [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
Problems running on tomcat
Hi, I'm new with solr and I'm doing my first installation under tomcat, I followed the documentation on link ( http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6) but there are some problems. The http://localhost:8080/solr/admin works fine, but in some cases, for example to see my schema.xml from the admin console the error bellow happensHTTP Status 404 - /solr/admin/file/index.jspSomebody already saw this? There are some trick to do? Tks -- Claudio Devecchi
Re: Solr Indexing slows down
Hi Erick! thanks for the response! I will answer your questions ;-) How often are you making changes to your index? Every 30-60 seconds. Too heavy? Do you have autocommit on? No. Do you commit when updating each document? No. I commit after a batch update of 200 documents Committing too often and consequently firing off warmup queries is the first place I'd look. Why is commiting firing warmup queries? Is there any documentation about this subject? How can I be sure that the previous commit has done its magic? there are several config values that influence the commit frequency I now know the autowarm and the mergeFactor config. What else? Is this documentation complete: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ? Regards, Peter. See the subject about 1500 threads. The first place I'd look is how often you're committing. If you're committing before the warmup queries from the previous commit have done their magic, you might be getting into a death spiral. HTH Erick On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote: Hi, I am indexing a solr 1.4.0 core and commiting gets slower and slower. Starting from 3-5 seconds for ~200 documents and ending with over 60 seconds after 800 commits. Then, if I reloaded the index, it is as fast as before! And today I have read a similar thread [1] and indeed: if I set autowarming for the caches to 0 the slowdown disappears. BUT at the same time I would like to offer searching on that core, which would be dramatically slowed down (due to no autowarming). Does someone know a better solution to avoid index-slow-down? Regards, Peter. [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
Re: Programmatically retrieving numDocs (or any other statistic)
Both approaches are ok, I think. (although I don't know the python API) BTW: If you query q=*:* then add rows=0 to avoid some traffic. Regards, Peter. I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. The only two ways I've come up with are searching for *:* and reporting the hit count, or sending an Http GET to http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat name=numDocs /stat in the response. Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, what's numDocs? (I'm doing this in Python, using Pysolr, if that matters.) Thanks!
Re: Solr searching performance issues, using large documents
Hi Peter :-), did you already try other values for hl.maxAnalyzedChars=2147483647 ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s -- http://karussell.wordpress.com/
Good list of English words that get butchered by Porter Stemmer
Hello, I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to iron, so if you search for ironic, you'll get iron matches ironic # same stem as animal anime animated animation animations I imagine such a list could be added to the example protwords.txt Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Solr Indexing slows down
Peter, there are events in solrconfig where you define warm up queries when a new searcher is opened. There are also cache settings that play a role here. 30-60 seconds is pretty frequent for Solr. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Peter Karich peat...@yahoo.de To: solr-user@lucene.apache.org Sent: Fri, July 30, 2010 4:06:48 PM Subject: Re: Solr Indexing slows down Hi Erick! thanks for the response! I will answer your questions ;-) How often are you making changes to your index? Every 30-60 seconds. Too heavy? Do you have autocommit on? No. Do you commit when updating each document? No. I commit after a batch update of 200 documents Committing too often and consequently firing off warmup queries is the first place I'd look. Why is commiting firing warmup queries? Is there any documentation about this subject? How can I be sure that the previous commit has done its magic? there are several config values that influence the commit frequency I now know the autowarm and the mergeFactor config. What else? Is this documentation complete: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ? Regards, Peter. See the subject about 1500 threads. The first place I'd look is how often you're committing. If you're committing before the warmup queries from the previous commit have done their magic, you might be getting into a death spiral. HTH Erick On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote: Hi, I am indexing a solr 1.4.0 core and commiting gets slower and slower. Starting from 3-5 seconds for ~200 documents and ending with over 60 seconds after 800 commits. Then, if I reloaded the index, it is as fast as before! And today I have read a similar thread [1] and indeed: if I set autowarming for the caches to 0 the slowdown disappears. BUT at the same time I would like to offer searching on that core, which would be dramatically slowed down (due to no autowarming). Does someone know a better solution to avoid index-slow-down? Regards, Peter. [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
Re: Programmatically retrieving numDocs (or any other statistic)
I suppose you could write a component that just gets this info from SolrIndexSearcher and write that in the response? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: John DeRosa jo...@ipstreet.com To: solr-user@lucene.apache.org Sent: Fri, July 30, 2010 1:39:03 PM Subject: Programmatically retrieving numDocs (or any other statistic) I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. The only two ways I've come up with are searching for *:* and reporting the hit count, or sending an Http GET to http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat name=numDocs /stat in the response. Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, what's numDocs? (I'm doing this in Python, using Pysolr, if that matters.) Thanks!
Re: question about relevance
May I suggest looking at some of the related issues, say SOLR-1682 This issue is related to: SOLR-1682 Implement CollapseComponent SOLR-1311 pseudo-field-collapsing LUCENE-1421 Ability to group search results by field SOLR-1773 Field Collapsing (lightweight version) SOLR-237 Field collapsing Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Bharat Jain bharat.j...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, July 30, 2010 10:40:19 AM Subject: Re: question about relevance Hi, Thanks a lot for the info and your time. I think field collapse will work for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but which file I should use for patch. We use solr-1.3. Thanks Bharat Jain On Fri, Jul 30, 2010 at 12:53 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : 1. There are user records of type A, B, C etc. (userId field in index is : common to all records) : 2. A user can have any number of A, B, C etc (e.g. think of A being a : language then user can know many languages like french, english, german etc) : 3. Records are currently stored as a document in index. : 4. A given query can match multiple records for the user : 5. If for a user more records are matched (e.g. if he knows both french and : german) then he is more relevant and should come top in UI. This is the : reason I wanted to add lucene scores assuming the greater score means more : relevance. if your goal is to get back users from each search, then you should probably change your indexing strategry so that each user has a single document -- fields like langauge can be multivalued, etc... then a search for language:en langauge:fr will return users who speak english or french, and hte ones that speak both will score higher. if you really cant change the index structure, then essentially waht you are looking for is a field collapsing solution on the userId field, where you want each collapsed group to get a cumulative score. i don't know if the existing field collapsing patches support this -- if you are already willing/capable to do it in the lcient then that may be the simplest thing to support moving foward. Adding the scores is certainly one metric you could use -- it's generally suspicious to try and imply too much meaning to scores in lucene/solr but that's becuase people typically try to imply broader absolute meaning. in the case of a single query the scores are relative eachother, and adding up all the scores for a given userId is approximaly what would happen in my example above -- except that there is also a coord factor that would penalalize documents that only match one clause ... it's complicated, but as an approximation adding the scores might give you what you are looking for -- only you can know for sure based on your specific data. -Hoss
Re: Good list of English words that get butchered by Porter Stemmer
Some collisions are listed here: http://www.attivio.com/blog/34-attivio-blog/333-doing-things-with-words-part-three-stemming-and-lemmatization.html Have you asked Martin Porter? You can find his e-mail here: http://tartarus.org/~martin/ wunder On Jul 30, 2010, at 1:41 PM, Otis Gospodnetic wrote: Hello, I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to iron, so if you search for ironic, you'll get iron matches ironic # same stem as animal anime animated animation animations I imagine such a list could be added to the example protwords.txt Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Stem collision, word protection, synonym hack
Hello, I'm wondering if anyone has good ideas for handling the following (Porter) stemming problem. The word city gets stemmed to citi. But citi is short for citibank, so we have a conflict - the stems of both city and citi are citi, so when you search for city, you will get matches that are really about citi(bank). Now, we could put citi in the do not stem list (protwords.txt), but it will be of no use because citi is already in the fully stemmed form. This leaves the option of not stemming cities or city (and perhaps making city a synonym for cities as a work around) by adding those words to protwords.txt, but this feels like a kluge. Are there more elegant solutions for cases like this one? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Stem collision, word protection, synonym hack
Otis, https://issues.apache.org/jira/browse/LUCENE-2055 may be of some help. cheers On 7/30/10 2:18 PM, Otis Gospodnetic wrote: Hello, I'm wondering if anyone has good ideas for handling the following (Porter) stemming problem. The word city gets stemmed to citi. But citi is short for citibank, so we have a conflict - the stems of both city and citi are citi, so when you search for city, you will get matches that are really about citi(bank). Now, we could put citi in the do not stem list (protwords.txt), but it will be of no use because citi is already in the fully stemmed form. This leaves the option of not stemming cities or city (and perhaps making city a synonym for cities as a work around) by adding those words to protwords.txt, but this feels like a kluge. Are there more elegant solutions for cases like this one? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Programmatically retrieving numDocs (or any other statistic)
Thanks! On Jul 30, 2010, at 1:11 PM, Peter Karich wrote: Both approaches are ok, I think. (although I don't know the python API) BTW: If you query q=*:* then add rows=0 to avoid some traffic. Regards, Peter. I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. The only two ways I've come up with are searching for *:* and reporting the hit count, or sending an Http GET to http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for stat name=numDocs /stat in the response. Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, what's numDocs? (I'm doing this in Python, using Pysolr, if that matters.) Thanks!
Re: Good list of English words that get butchered by Porter Stemmer
On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to iron, so if you search for ironic, you'll get iron matches ironic # same stem as animal anime animated animation animations I imagine such a list could be added to the example protwords.txt +1 No reason to make everyone come up with their own list. Unless a good list already exists out there... we could semi-automate it by running a large corpus through the stemmer and then for each stem, list the original words. The manual part would be looking at the output to see the collisions (unless someone has a better idea). -Yonik http://www.lucidimagination.com
Some basic DataImportHandler questions
Just starting with DataImportHandler and had a few simple questions. Is there a location for more in depth documentation other than http://wiki.apache.org/solr/DataImportHandler? Specifically I was looking for a detailed document outlining data-config.xml, the fields and attributes and how they are used. * Is there a way to dynamically generate field elements from the supplied sql statement? Example: Suppose one has a table of 100 fields. Entering this manually for each field is not very efficient. ie, if table has only 3 columns this is easy enough... entity name=item query=select * from item field column=ID name=id / field column=NAME name=name / field column=MANU name=manu / entity What are the options if ITEM table has dozens or hundreds? * Is there a way to apply insert logic based on the value of the incoming field? My specific use case would be, if the incoming value is null, do not add to Solr. ie Record is : ID : 50 NAME : Blahblah MANU : null entity name=item query=select * from item field column=ID name=id / field column=NAME name=name / field column=MANU name=manu / entity Using the following in data-config.xml is there are way to ignore null fields altogether? I see some special commands listed such as $skipRecord, is there some type of $skipField operation? Thanks
Re: Solr Indexing slows down
Hi Otis, does it mean that a new searcher is opened after I commit? I thought only on startup...(?) Regards, Peter. Peter, there are events in solrconfig where you define warm up queries when a new searcher is opened. There are also cache settings that play a role here. 30-60 seconds is pretty frequent for Solr. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Peter Karich peat...@yahoo.de To: solr-user@lucene.apache.org Sent: Fri, July 30, 2010 4:06:48 PM Subject: Re: Solr Indexing slows down Hi Erick! thanks for the response! I will answer your questions ;-) How often are you making changes to your index? Every 30-60 seconds. Too heavy? Do you have autocommit on? No. Do you commit when updating each document? No. I commit after a batch update of 200 documents Committing too often and consequently firing off warmup queries is the first place I'd look. Why is commiting firing warmup queries? Is there any documentation about this subject? How can I be sure that the previous commit has done its magic? there are several config values that influence the commit frequency I now know the autowarm and the mergeFactor config. What else? Is this documentation complete: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ? Regards, Peter. See the subject about 1500 threads. The first place I'd look is how often you're committing. If you're committing before the warmup queries from the previous commit have done their magic, you might be getting into a death spiral. HTH Erick On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote: Hi, I am indexing a solr 1.4.0 core and commiting gets slower and slower. Starting from 3-5 seconds for ~200 documents and ending with over 60 seconds after 800 commits. Then, if I reloaded the index, it is as fast as before! And today I have read a similar thread [1] and indeed: if I set autowarming for the caches to 0 the slowdown disappears. BUT at the same time I would like to offer searching on that core, which would be dramatically slowed down (due to no autowarming). Does someone know a better solution to avoid index-slow-down? Regards, Peter. [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html -- http://karussell.wordpress.com/
RE: Good list of English words that get butchered by Porter Stemmer
A good starting place might be the list of stemming errors for the original Porter stemmer in this article that describes k-stem: Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 191-202). Pittsburgh, Pennsylvania, United States: ACM. doi:10.1145/160688.160718 I don't know if the current porter stemmer is different. I do see that on the snowball page there is a porter and a porter2 stemmer and this explanation is linked from the porter2 stemmer page: http://snowball.tartarus.org/algorithms/english/stemmer.html Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Friday, July 30, 2010 4:42 PM To: solr-user@lucene.apache.org Subject: Good list of English words that get butchered by Porter Stemmer Hello, I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to iron, so if you search for ironic, you'll get iron matches ironic # same stem as animal anime animated animation animations I imagine such a list could be added to the example protwords.txt Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Good list of English words that get butchered by Porter Stemmer
Otis, I think this is a great idea. you could also go even further by making a better example for StemmerOverrideFilter (stemdict.txt) ( http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory ) for example: animated tab animate animation tab animation animations tab animation this might be a bit better (but more work!) than protected words since then you could let animation and animations conflate, rather than just forcing them to be all unchanged. i wouldnt go crazy and worry about animator matching animation etc, but would at least let plural forms match the singular, without screwing other things up. On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to iron, so if you search for ironic, you'll get iron matches ironic # same stem as animal anime animated animation animations I imagine such a list could be added to the example protwords.txt Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Robert Muir rcm...@gmail.com
Re: Solr searching performance issues, using large documents
Wait- how much text are you highlighting? You say these logfiles are X big- how big are the actual documents you are storing? On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich peat...@yahoo.de wrote: Hi Peter :-), did you already try other values for hl.maxAnalyzedChars=2147483647 ? Also regular expression highlighting is more expensive, I think. What does the 'fuzzy' variable mean? If you use this to query via ~someTerm instead someTerm then you should try the trunk of solr which is a lot faster for fuzzy or other wildcard search. Regards, Peter. Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!=])/,'\1') + fuzzy + minLogSizeStr) thequery = '/solr/select?timeAllowed=5000wt=ruby' + (p['fq'].empty? ? '' : ('fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl + hl_regex baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 'rows=' + p['rows'].to_s + 'minLogSize=' + p['minLogSize'].to_s -- http://karussell.wordpress.com/ -- Lance Norskog goks...@gmail.com
Re: Solr Indexing slows down
As you make changes to your index, you probably want to see the new/modified documents in your search results. In order to do that, the new searcher needs to be reopened, and this happens on commit. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Peter Karich peat...@yahoo.de To: solr-user@lucene.apache.org Sent: Fri, July 30, 2010 6:19:03 PM Subject: Re: Solr Indexing slows down Hi Otis, does it mean that a new searcher is opened after I commit? I thought only on startup...(?) Regards, Peter. Peter, there are events in solrconfig where you define warm up queries when a new searcher is opened. There are also cache settings that play a role here. 30-60 seconds is pretty frequent for Solr. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Peter Karich peat...@yahoo.de To: solr-user@lucene.apache.org Sent: Fri, July 30, 2010 4:06:48 PM Subject: Re: Solr Indexing slows down Hi Erick! thanks for the response! I will answer your questions ;-) How often are you making changes to your index? Every 30-60 seconds. Too heavy? Do you have autocommit on? No. Do you commit when updating each document? No. I commit after a batch update of 200 documents Committing too often and consequently firing off warmup queries is the first place I'd look. Why is commiting firing warmup queries? Is there any documentation about this subject? How can I be sure that the previous commit has done its magic? there are several config values that influence the commit frequency I now know the autowarm and the mergeFactor config. What else? Is this documentation complete: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ? Regards, Peter. See the subject about 1500 threads. The first place I'd look is how often you're committing. If you're committing before the warmup queries from the previous commit have done their magic, you might be getting into a death spiral. HTH Erick On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich peat...@yahoo.de wrote: Hi, I am indexing a solr 1.4.0 core and commiting gets slower and slower. Starting from 3-5 seconds for ~200 documents and ending with over 60 seconds after 800 commits. Then, if I reloaded the index, it is as fast as before! And today I have read a similar thread [1] and indeed: if I set autowarming for the caches to 0 the slowdown disappears. BUT at the same time I would like to offer searching on that core, which would be dramatically slowed down (due to no autowarming). Does someone know a better solution to avoid index-slow-down? Regards, Peter. [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html -- http://karussell.wordpress.com/
Re: Some basic DataImportHandler questions
On Sat, Jul 31, 2010 at 3:40 AM, Harry Smith harrysmith...@gmail.comwrote: Just starting with DataImportHandler and had a few simple questions. Is there a location for more in depth documentation other than http://wiki.apache.org/solr/DataImportHandler? Umm, no, but let us know what is not covered well and it can be added. Specifically I was looking for a detailed document outlining data-config.xml, the fields and attributes and how they are used. * Is there a way to dynamically generate field elements from the supplied sql statement? Example: Suppose one has a table of 100 fields. Entering this manually for each field is not very efficient. ie, if table has only 3 columns this is easy enough... entity name=item query=select * from item field column=ID name=id / field column=NAME name=name / field column=MANU name=manu / entity What are the options if ITEM table has dozens or hundreds? Yes! You do not need to specify the column names as long as your Solr schema defines the same field names. See http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config * Is there a way to apply insert logic based on the value of the incoming field? My specific use case would be, if the incoming value is null, do not add to Solr. DIH Transformers are the way. However, in this particular case, you do not need to worry because nulls are not inserted into the index. -- Regards, Shalin Shekhar Mangar.
Re: Programmatically retrieving numDocs (or any other statistic)
: I want to programmatically retrieve the number of indexed documents. I.e., get the value of numDocs. Index level stats like this can be fetched from the LukeRequestHandler in any recent version of SOlr... http://localhost:8983/solr/admin/luke?numTerms=0 In future releases (ie: already in trunk and branch 3x) there is also the SolrInfoMBeanRequestHandler which will replace registry.jsp and stats.jsp https://issues.apache.org/jira/browse/SOLR-1750 -Hoss