Anyway to know changed documents?
Hi everyone, If I have two server ,their indexes should be synchronized. I changed A's index via HTTP send document objects, Is there any config or some plug-ins to let solr know which objects are changed and can push it B ? Any suggestion will be appreciate. Thanks :)
Re: Anyway to know changed documents?
I think you should look at the indextime field. There are examples in the wiki. paul Le 1 juin 2011 à 08:07, 京东 a écrit : Hi everyone, If I have two server ,their indexes should be synchronized. I changed A's index via HTTP send document objects, Is there any config or some plug-ins to let solr know which objects are changed and can push it B ? Any suggestion will be appreciate. Thanks :)
Re: Solr vs ElasticSearch
Thanks Shashi, this is oddly coincidental with another issue being put into Solr (SOLR-2193) to help solve some of the NRT issues, the timing is impeccable. At a base however Solr uses Lucene, as does ES. I think the main advantage of ES is the auto-sharding etc. I think it uses a gossip protocol to capitalize on this however... Hmm... On Tue, May 31, 2011 at 10:01 PM, Shashi Kant sk...@sloan.mit.edu wrote: Here is a very interesting comparison http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ -Original Message- From: Mark Sent: May-31-11 10:33 PM To: solr-user@lucene.apache.org Subject: Solr vs ElasticSearch I've been hearing more and more about ElasticSearch. Can anyone give me a rough overview on how these two technologies differ. What are the strengths/weaknesses of each. Why would one choose one of the other? Thanks
Re: Solr vs ElasticSearch
Well, I recently chose it for a personal project and the deciding thing for me was that it had nice integration to couchdb. Thanks, Bryan Rasmussen On Wed, Jun 1, 2011 at 4:33 AM, Mark static.void@gmail.com wrote I've been hearing more and more about ElasticSearch. Can anyone give me a rough overview on how these two technologies differ. What are the strengths/weaknesses of each. Why would one choose one of the other? Thanks
Query problem in Solr
Hi all, We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field `shop_keyword` which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword apple and another which has orange, a search for shops having `Apple AND Orange` would return the shop for these products. However, this is incorrect since we want that a search for shops having `Apple AND Orange` returns shop(s) having products with both apple and orange as keywords. We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post http://markmail.org/thread/xce4qyzs5367yplo#query:+page:1+mid:76eerw5yqev2aanu+state:results, Solr does not support all words must match in the same value of a multi-valued field. (Hope I explained myself well) How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically. Thanks! Krt_Malta
Synonyms valid only in specific categories of data
Hello to all, I have a collection of text phrases in more than 20 languages that I'm indexing in solr. Each phrase belongs to one of about 30 different phrase categories. I have specified different fields for each language and added a synonym filter at query time. I would however like the synonym filter to take into account the category as well. So, a specific synonym should be valid and used only in one or more categories per language. (the category is indexed in another field). Is this somehow possible in the current SynonymFilterFactory implementation? Hope it makes sense. Thank you, Spyros
Re: Synonyms valid only in specific categories of data
I don't think you can assign a synonyms file dynamically to a field. you would need to create multiple fields for each lang / cat phrases and have their own synonyms file referenced for each field. that would be a lot of fields. On 1 June 2011 09:59, Spyros Kapnissis ska...@yahoo.com wrote: Hello to all, I have a collection of text phrases in more than 20 languages that I'm indexing in solr. Each phrase belongs to one of about 30 different phrase categories. I have specified different fields for each language and added a synonym filter at query time. I would however like the synonym filter to take into account the category as well. So, a specific synonym should be valid and used only in one or more categories per language. (the category is indexed in another field). Is this somehow possible in the current SynonymFilterFactory implementation? Hope it makes sense. Thank you, Spyros
Re: Anyway to know changed documents?
If your index size if smaller (a few 100 MBs), you can consider the SOLR's operational script tools provided with distribution to sync indexes from Master to Slave servers. It will update(copies) the latest index snapshot from Master to Slave(s). SOLR wiki provides good info on how to set them as Cron, so, no manual intervention is required. BTW, SOLR1.4+ ,also has feature where only the changed segment gets synched(but then index need not be optimized) -- View this message in context: http://lucene.472066.n3.nabble.com/Anyway-to-know-changed-documents-tp3009527p3010015.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Anyway to know changed documents?
Thanks pravesh ^_^ You said BTW, SOLR1.4+ ,also has feature where only the changed segment gets synched. Can you give me a document or some detail information please ? I've looked up at online documents but didn't find any information . Thanks very much . 发件人: pravesh 发送时间: 2011-06-01 17:44:55 收件人: solr-user 抄送: 主题: Re: Anyway to know changed documents? If your index size if smaller (a few 100 MBs), you can consider the SOLR's operational script tools provided with distribution to sync indexes from Master to Slave servers. It will update(copies) the latest index snapshot from Master to Slave(s). SOLR wiki provides good info on how to set them as Cron, so, no manual intervention is required. BTW, SOLR1.4+ ,also has feature where only the changed segment gets synched(but then index need not be optimized) -- View this message in context: http://lucene.472066.n3.nabble.com/Anyway-to-know-changed-documents-tp3009527p3010015.html Sent from the Solr - User mailing list archive at Nabble.com. .
Re: Query problem in Solr
We're using Solr to search on a Shop index and a Product index Do you have 2 separate indexes (using distributed shard search)?? I'm sure you are actually having only single index. Currently a Shop has a field `shop_keyword` which also contains the keywords of the products assigned to it. You mean, for a shop, you are first concatenating all keywords of all products and then saving in shop_keywords field for the shop?? In this case there is no way u can identify which keyword occurs in which product in ur index. You might need to change the index structure, may be, when u post documents, then post a single document for a single product(with fields like title,price,shop-id, etc), instead of single document for a single shop. Hope I make myself clear -- View this message in context: http://lucene.472066.n3.nabble.com/Query-problem-in-Solr-tp3009812p3010072.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: Anyway to know changed documents?
SOLR wiki will provide help on this. You might be interested in pure Java based replication too. I'm not sure,whether SOLR operational will have this feature(synch'ing only changed segments). You might need to change configuration in searchconfig.xml -- View this message in context: http://lucene.472066.n3.nabble.com/Anyway-to-know-changed-documents-tp3009527p3010085.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr vs ElasticSearch
On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen jason.rutherg...@gmail.com wrote: Mark, Nice email address. I personally have no idea, maybe ask Shay Banon to post an answer? I think it's possible to make Solr more elastic, eg, it's currently difficult to make it move cores between servers without a lot of manual labor. I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Obtaining query AST?
Thats pretty awesome. Thanks Renaud! On Tue, 2011-05-31 at 22:56 +0100, Renaud Delbru wrote: Hi, have a look at the flexible query parser of lucene (contrib package) [1]. It provides a framework to easily create different parsing logic. You should be able to access the AST and to modify as you want how it can be translated into a Lucene query (look at processors and pipeline processors). One time you have your own query parser, then it is straightforward to plug it into Solr. [1] http://lucene.apache.org/java/3_1_0/api/contrib-queryparser/index.html
Re: Problem with caps and star symbol
Thanks for your point. I was really tripping that issue. But Now I need a bit help more. As far I have noticed that in the case of a value like *role_delete* , WordDelimiterFilterFactory index two words like *role* and *delete* and in both search result with the term *role* and *delete* will include that document. Now In the case of the value like *role_delete* I want to index all four terms like [ *role_delete, roledelete, role, delete ].* In total both the original and processed word by WordDelimiterFilterFactory will be indexed. Is it possible ?? Does any additional filter with WordDelimiterFilterFactory can do that ?? Or any filter can do such like operation ?? On Tue, May 31, 2011 at 8:07 PM, Erick Erickson erickerick...@gmail.comwrote: I think you're tripping over the issue that wildcards aren't analyzed, they don't go through your analysis chain. So the casing matters. Try lowercasing the input and I believe you'll see more like what you expect... Best Erick On Mon, May 30, 2011 at 12:07 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: I am sending some xml to understand the scenario. Indexed term = ROLE_DELETE Search Term = roledelete response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : roledelete/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 Indexed term = ROLE_DELETE Search Term = role response lst name=responseHeader int name=status0/int int name=QTime5/int lst name=params str name=indenton/str str name=start0/str str name=qname : role/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response Indexed term = ROLE_DELETE Search Term = role* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : role*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response Indexed term = ROLE_DELETE Search Term = Role* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : Role*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0/ /response Indexed term = ROLE_DELETE Search Term = ROLE_DELETE* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : ROLE_DELETE*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0/ /response I am also adding a analysis html. On Mon, May 30, 2011 at 7:19 AM, Erick Erickson erickerick...@gmail.com wrote: I'd start by looking at the analysis page from the Solr admin page. That will give you an idea of the transformations the various steps carry out, it's invaluable! Best Erick On May 26, 2011 12:53 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: Hi all , In my schema.xml i am using WordDelimiterFilterFactory, LowerCaseFilterFactory, StopFilterFactory for index analyzer and an extra SynonymFilterFactory for query analyzer. I am indexing a field name '*name*'.Now if a value with all caps like NAME_BILL is indexed I am able get this as search result with the term *name_bill *, *NAME_BILL *, *namebill *, *namebill** , *nameb** ... But for the term like following * NAME_BILL** , *name_bill** , *namebill** , *NAME** the result does mot show this document. Can anyone please explain why this is happening? .In fact star * is not giving any result in many cases specially if it is used after full value of a field. Portion of my schema is given below. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 - analyzer tokenizer
Re: Solr memory consumption
My OS is also CentOS (5.4). If it were 10gb all the time it would be ok, but it grows for 13-15gb, and hurts other services =\ It could be environment specific (specific of your top command implementation, OS, etc) I have on CentOS 2986m virtual memory showing although -Xmx2g You have 10g virtual although -Xmx6g Don't trust it too much... top command may count OS buffers for opened files, network sockets, JVM DLLs itself, etc (which is outside Java GC responsibility); additionally to JVM memory... it counts all memory, not sure... if you don't have big values for 99.9%wa (which means WAIT I/O - disk swap usage) everyhing is fine... -Original Message- From: Denis Kuzmenok Sent: May-31-11 4:18 PM To: solr-user@lucene.apache.org Subject: Solr memory consumption I run multiple-core solr with flags: -Xms3g -Xmx6g -D64, but i see this in top after 6-8 hours and still raising: 17485 test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java -Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar Are there any ways to limit memory for sure? Thanks
London open source search social - 13th June
Hi guys, Just to let you know we're meeting up to talk all-things-search on Monday 13th June. There's usually a good mix of backgrounds and experience levels so if you're free and in the London area then it'd be good to see you there. Details: 7pm - The Elgin - 96 Ladbrooke Grove http://www.meetup.com/london-search-social/events/20387881/ Greetings search geeks! We've booked the next meetup for the 13th June. As usual, the plan is to meet up and geek out over a friendly beer. I know my co-organiser René has been working on some interesting search projects, and I've recently left Empora to work on my own project so by June I should hopefully have some war stories about using @elasticsearch in production. The format is completely open though so please bring your own topics if you've got them. Hope to see you there! -- Richard Marr -- Richard Marr
Re: Anyway to know changed documents?
You may be interested in Solr's replication feature? http://wiki.apache.org/solr/SolrReplication On 6/1/2011 2:07 AM, wrote: Hi everyone, If I have two server ,their indexes should be synchronized. I changed A's index via HTTP send document objects, Is there any config or some plug-ins to let solr know which objects are changed and can push it B ? Any suggestion will be appreciate. Thanks :)
Re: Anyway to know changed documents?
On 6/1/2011 6:12 AM, pravesh wrote: SOLR wiki will provide help on this. You might be interested in pure Java based replication too. I'm not sure,whether SOLR operational will have this feature(synch'ing only changed segments). You might need to change configuration in searchconfig.xml Yes, this feature is there in the Java/HTTP based replication since Solr 1.4
Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances
Lee, Thank you very much for your answer. Using the signature field as the uniqueKey is effectively what I was doing, so the overwriteDupes=true parameter in my solrconfig was somehow redundant, although I wasn't aware of it! =D In practice it works perfectly and that's the nice part. By the way, I wonder what happens when we enter in the following code snippet when the id field is the same as the signature field, from addDoc@DirectUpdateHandler2(AddUpdateCommand) : if(del) { // ensure id remains unique BooleanQuery bq = new BooleanQuery(); bq.add(new BooleanClause(new TermQuery(updateTerm), Occur.MUST_NOT)); bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST)); writer.deleteDocuments(bq); } May be all my problems started from here... I'll try to reproduce using a different uniqueKey field and turning overwriteDupes back to on to see if the problem was because of the signature field being the same as the uniqueKey field *and* having overwriteDupes on, when I'll have some time. If so, maybe that a simple configuration check should be performed to avoid the issue. Otherwise it means that having overwriteDupes turned on simply doesn't scale and that should be added to the wiki's Deduplication page, IMHO. Thank you again. Regards, -- Tanguy On 31/05/2011 14:58, lee carroll wrote: Tanguy You might have tried this already but can you set overwritedupes to false and set the signiture key to be the id. That way solr will manage updates? from the wiki http://wiki.apache.org/solr/Deduplication !-- An example dedup update processor that creates the id field on the fly based on the hash code of some other fields. This example has overwriteDupes set to false since we are using the id field as the signatureField and Solr will maintain uniqueness based on that anyway. -- HTH Lee On 30 May 2011 08:32, Tanguy Moaltanguy.m...@gmail.com wrote: Hello, Sorry for re-posting this but it seems my message got lost in the mailing list's messages stream without hitting anyone's attention... =D Shortly, has anyone already experienced dramatic indexing slowdowns during large bulk imports with overwriteDupes turned on and a fairly high duplicates rate (around 4-8x) ? It seems to produce a lot of deletions, which in turn appear to make the merging of segments pretty slow, by fairly increasing the number of little reads operations occuring simultaneously with the regular large write operations of the merge. Added to the poor IO performances of a commodity SATA drive, indexing takes ages. I temporarily bypassed that limitation by disabling the overwriting of duplicates, but that changes the way I request the index, requiring me to turn on field collapsing at search time. Is this a known limitation ? Has anyone a few hints on how to optimize the handling of index time deduplication ? More details on my setup and the state of my understanding are in my previous message here-after. Thank you very much in advance. Regards, Tanguy On 05/25/11 15:35, Tanguy Moal wrote: Dear list, I'm posting here after some unsuccessful investigations. In my setup I push documents to Solr using the StreamingUpdateSolrServer. I'm sending a comfortable initial amount of documents (~250M) and wished to perform overwriting of duplicated documents at index time, during the update, taking advantage of the UpdateProcessorChain. At the beginning of the indexing stage, everything is quite fast; documents arrive at a rate of about 1000 doc/s. The only extra processing during the import is computation of a couple of hashes that are used to identify uniquely documents given their content, using both stock (MD5Signature) and custom (derived from Lookup3Signature) update processors. I send a commit command to the server every 500k documents sent. During a first period, the server is CPU bound. After a short while (~10 minutes), the rate at which documents are received starts to fall dramatically, the server being IO bound. I've been firstly thinking of a normal speed decrease during the commit, while my push client is waiting for the flush to occur. That would have been a normal slowdown. The thing that retained my attention was the fact that unexpectedly, the server was performing a lot of small reads, way more the number writes, which seem to be larger. The combination of the many small reads with the constant amount of bigger writes seem to be creating a lot of IO contention on my commodity SATA drive, and the ETA of my built index started to increase scarily =D I then restarted the JVM with JMX enabled so I could start investigating a little bit more. I've the realized that the UpdateHandler was performing many reads while processing the update request. Are there any known limitations around the UpdateProcessorChain, when overwriteDupes is set to true ? I turned that off, which of course breaks the
Re: Index vs. Query Time Aware Filters
Could you post one of your pairs of definitions? Because I don't recognize queryMode and a web search doesn't turn anything up, so I'm puzzled. Best Erick On Wed, Jun 1, 2011 at 1:13 AM, Mike Schultz mike.schu...@gmail.com wrote: We have very long schema files for each of our language dependent query shards. One thing that is doubling the configuration length of our main text processing field definition is that we have to repeat the exact same filter chain for query time version EXCEPT with a queryMode=true parameter. Is there a way for a filter to figure out if it's the index vs. query time version? A similar wish would be for the filter to be able to figure out the name of the field currently being indexed. This would allow a filter to set a parameter at runtime based on fieldname, instead of boilerplate copying the same filterchain definition in schema.xml EXCEPT for one parameter. The motivation is again to reduce errors and increase readability of the schema file. -- View this message in context: http://lucene.472066.n3.nabble.com/Index-vs-Query-Time-Aware-Filters-tp3009450p3009450.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query problem in Solr
If I read this correctly, one approach is to specify an increment gap in a multiValued field, then search for phrases with a slop less than that increment gap. i.e. incrementGap=100 in your definition, and search for apple orange~99 If this is gibberish, please post some examples and we'll try something else. Best Erick On Wed, Jun 1, 2011 at 4:21 AM, Kurt Sultana kurtanat...@gmail.com wrote: Hi all, We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field `shop_keyword` which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword apple and another which has orange, a search for shops having `Apple AND Orange` would return the shop for these products. However, this is incorrect since we want that a search for shops having `Apple AND Orange` returns shop(s) having products with both apple and orange as keywords. We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post http://markmail.org/thread/xce4qyzs5367yplo#query:+page:1+mid:76eerw5yqev2aanu+state:results, Solr does not support all words must match in the same value of a multi-valued field. (Hope I explained myself well) How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically. Thanks! Krt_Malta
Re: Problem with caps and star symbol
Take a look here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory I think you want generateWordParts=1, catenateWords=1 and preserveOriginal=1, but check it out with the admin/analysis page. Oh, and your index-time and query-time patterns for WDFF will probably be different, see the example schema. Best Erick On Wed, Jun 1, 2011 at 7:40 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: Thanks for your point. I was really tripping that issue. But Now I need a bit help more. As far I have noticed that in the case of a value like *role_delete* , WordDelimiterFilterFactory index two words like *role* and *delete* and in both search result with the term *role* and *delete* will include that document. Now In the case of the value like *role_delete* I want to index all four terms like [ *role_delete, roledelete, role, delete ].* In total both the original and processed word by WordDelimiterFilterFactory will be indexed. Is it possible ?? Does any additional filter with WordDelimiterFilterFactory can do that ?? Or any filter can do such like operation ?? On Tue, May 31, 2011 at 8:07 PM, Erick Erickson erickerick...@gmail.comwrote: I think you're tripping over the issue that wildcards aren't analyzed, they don't go through your analysis chain. So the casing matters. Try lowercasing the input and I believe you'll see more like what you expect... Best Erick On Mon, May 30, 2011 at 12:07 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: I am sending some xml to understand the scenario. Indexed term = ROLE_DELETE Search Term = roledelete response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : roledelete/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 Indexed term = ROLE_DELETE Search Term = role response lst name=responseHeader int name=status0/int int name=QTime5/int lst name=params str name=indenton/str str name=start0/str str name=qname : role/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response Indexed term = ROLE_DELETE Search Term = role* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : role*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response Indexed term = ROLE_DELETE Search Term = Role* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : Role*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0/ /response Indexed term = ROLE_DELETE Search Term = ROLE_DELETE* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : ROLE_DELETE*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0/ /response I am also adding a analysis html. On Mon, May 30, 2011 at 7:19 AM, Erick Erickson erickerick...@gmail.com wrote: I'd start by looking at the analysis page from the Solr admin page. That will give you an idea of the transformations the various steps carry out, it's invaluable! Best Erick On May 26, 2011 12:53 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: Hi all , In my schema.xml i am using WordDelimiterFilterFactory, LowerCaseFilterFactory, StopFilterFactory for index analyzer and an extra SynonymFilterFactory for query analyzer. I am indexing a field name '*name*'.Now if a value with all caps like NAME_BILL is indexed I am able get this as search result with the term *name_bill *, *NAME_BILL *, *namebill *, *namebill** , *nameb** ... But for the
Re: collapse component with pivot faceting
You might have more luck going the other way, applying the field collapsing patch to trunk. This is currently being worked on, see: https://issues.apache.org/jira/browse/SOLR-2564 Best Erick On Wed, Jun 1, 2011 at 12:22 AM, Isha Garg isha.g...@orkash.com wrote: Hi, Actually currently I am using solr version 3.0 . I applied the field collapsing patch of solr . The field collapsing work fine with collapse.facet=after for any facet.field but when I try to use facet.pivot query after collapse.facet=after it does nt show any results. Also pivot faceting feature is not present in solr 3.0. So which pivot faceting patch should I use with solr 3.0,solr 4.0 support the pivot faceting but it does not have field collapsing feature.Can anyone guide me regarding which Solr version support both field collapsing and pivot faceting . Thanks in Advance! Isha Garg On Tuesday 31 May 2011 07:39 PM, Erick Erickson wrote: Please provide a more detailed request. This is so general that it's hard to respond. What is the use-case you're trying to understand/implement? You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Mon, May 30, 2011 at 4:31 AM, Isha Gargisha.g...@orkash.com wrote: Hi All! Can anyone tell me how pivot faceting works in combination with field collapsing.? Please guide me in this respect. Thanks! Isha Garg
Re: Edgengram
Hi Tomás, Thank you very much for your suggestion. I took another crack at it using your recommendation and it worked ideally. The only thing I had to change was analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer to analyzer type=query tokenizer class=solr.LowerCaseTokenizerFactory / /analyzer The first did not produce any results but the second worked beautifully. Thanks! Brian Lamb 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com ...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb -- Thanks and Regards, DakshinaMurthy BM
Re: Synonyms valid only in specific categories of data
Yes that would probably be a lot of fields.. I guess a way would be to extend the SynonymFilter and change the format of the synonyms.txt file to take the categories into account. Thanks again for your answer. From: lee carroll lee.a.carr...@googlemail.com To: solr-user@lucene.apache.org Sent: Wednesday, June 1, 2011 12:23 PM Subject: Re: Synonyms valid only in specific categories of data I don't think you can assign a synonyms file dynamically to a field. you would need to create multiple fields for each lang / cat phrases and have their own synonyms file referenced for each field. that would be a lot of fields. On 1 June 2011 09:59, Spyros Kapnissis ska...@yahoo.com wrote: Hello to all, I have a collection of text phrases in more than 20 languages that I'm indexing in solr. Each phrase belongs to one of about 30 different phrase categories. I have specified different fields for each language and added a synonym filter at query time. I would however like the synonym filter to take into account the category as well. So, a specific synonym should be valid and used only in one or more categories per language. (the category is indexed in another field). Is this somehow possible in the current SynonymFilterFactory implementation? Hope it makes sense. Thank you, Spyros
Re: Solr vs ElasticSearch
I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Right, in theory it's quite simple, in practice I've setup a master, then a slave, then had to add replication to both, then call create core, then replicate, then unload core on the master. It's nightmarish to setup. The problem is, it freezes each core into a respective role, so if I wanted to then 'move' the slave, I can't because it's still setup as a slave. On Wed, Jun 1, 2011 at 4:14 AM, Upayavira u...@odoko.co.uk wrote: On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen jason.rutherg...@gmail.com wrote: Mark, Nice email address. I personally have no idea, maybe ask Shay Banon to post an answer? I think it's possible to make Solr more elastic, eg, it's currently difficult to make it move cores between servers without a lot of manual labor. I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Solr vs ElasticSearch
On Wed, 01 Jun 2011 07:52 -0700, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Right, in theory it's quite simple, in practice I've setup a master, then a slave, then had to add replication to both, then call create core, then replicate, then unload core on the master. It's nightmarish to setup. The problem is, it freezes each core into a respective role, so if I wanted to then 'move' the slave, I can't because it's still setup as a slave. Yep, I'm expecting it to require some changes to both the CoreAdminHandler and the ReplicationHandler. Probably the ReplicationHandler would need a 'one-off' replication command. And some way to delete the core when it has been transferred. Upayavira On Wed, Jun 1, 2011 at 4:14 AM, Upayavira u...@odoko.co.uk wrote: On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen jason.rutherg...@gmail.com wrote: Mark, Nice email address. I personally have no idea, maybe ask Shay Banon to post an answer? I think it's possible to make Solr more elastic, eg, it's currently difficult to make it move cores between servers without a lot of manual labor. I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Solr vs ElasticSearch
And some way to delete the core when it has been transferred. Right, I manually added that to CoreAdminHandler. I opened an issue to try to solve this problem: SOLR-2569 On Wed, Jun 1, 2011 at 8:26 AM, Upayavira u...@odoko.co.uk wrote: On Wed, 01 Jun 2011 07:52 -0700, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Right, in theory it's quite simple, in practice I've setup a master, then a slave, then had to add replication to both, then call create core, then replicate, then unload core on the master. It's nightmarish to setup. The problem is, it freezes each core into a respective role, so if I wanted to then 'move' the slave, I can't because it's still setup as a slave. Yep, I'm expecting it to require some changes to both the CoreAdminHandler and the ReplicationHandler. Probably the ReplicationHandler would need a 'one-off' replication command. And some way to delete the core when it has been transferred. Upayavira On Wed, Jun 1, 2011 at 4:14 AM, Upayavira u...@odoko.co.uk wrote: On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen jason.rutherg...@gmail.com wrote: Mark, Nice email address. I personally have no idea, maybe ask Shay Banon to post an answer? I think it's possible to make Solr more elastic, eg, it's currently difficult to make it move cores between servers without a lot of manual labor. I'm likely to try playing with moving cores between hosts soon. In theory it shouldn't be hard. We'll see what the practice is like! Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Edgengram
Be a little careful here. LowerCaseTokenizerFactory is different than KeywordTokenizerFactory. LowerCaseTokenizerFactory will give you more than one term. e.g. the string Intelligence can't be MeaSurEd will give you 5 terms, any of which may match. i.e. intelligence, can, t, be, measured. whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter would give you exactly one token: intelligence can't be measured. So searching for measured would get a hit in the first case but not in the second. Searching for intellig* would hit both. Neither is better, just make sure they do what you want! This page will help a lot: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory as will the admin/analysis page. Best Erick On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi Tomás, Thank you very much for your suggestion. I took another crack at it using your recommendation and it worked ideally. The only thing I had to change was analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer to analyzer type=query tokenizer class=solr.LowerCaseTokenizerFactory / /analyzer The first did not produce any results but the second worked beautifully. Thanks! Brian Lamb 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com ...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory
Re: Solr vs ElasticSearch
On 6/1/2011 10:52 AM, Jason Rutherglen wrote: nightmarish to setup. The problem is, it freezes each core into a respective role, so if I wanted to then 'move' the slave, I can't because it's still setup as a slave. Don't know if this helps or not, but you CAN set up a core as both a master and a slave. Normally this is to make it a repeater, still always taking from the same upstream and sending downstream. But there might be a way to hack it for your needs without actually changing Java code, a core _can_ be both a master and slave simultaneously, and there might be a way to change it's masterURL (where it pulls from when acting as a slave) without restarting the core too. You can supply a 'custom' (not configured) masterURL in a manual 'pull' command (over HTTP), but of course usually slaves poll rather than be directed by manual 'pull' commands.
Re: Solr vs ElasticSearch
On 6/1/2011 11:26 AM, Upayavira wrote: Probably the ReplicationHandler would need a 'one-off' replication command... It's got one already, if you mean a command you can issue to a slave to tell it to pull replication right now. The thing is, you can only issue this command if the core is configured as a slave. You can turn off polling though. You can include a custom masterURL in the one-off pull command, which over-rides whatever masterURL is configured in the core --- but you still need a masterURL configured in the core, or Solr will complain on startup if the core is configured as slave without a masterURL. (And if it's not configured as a slave, you can't issue the one-off pull command). This is all from my experience on 1.4, don't know if things change in 3.1, probably not.
Re: Solr vs ElasticSearch
Jonathan, This is all true, however it ends up being hacky (this is from experience) and the core on the source needs to be deleted. Feel free to post to the issue. Jason On Wed, Jun 1, 2011 at 8:44 AM, Jonathan Rochkind rochk...@jhu.edu wrote: On 6/1/2011 10:52 AM, Jason Rutherglen wrote: nightmarish to setup. The problem is, it freezes each core into a respective role, so if I wanted to then 'move' the slave, I can't because it's still setup as a slave. Don't know if this helps or not, but you CAN set up a core as both a master and a slave. Normally this is to make it a repeater, still always taking from the same upstream and sending downstream. But there might be a way to hack it for your needs without actually changing Java code, a core _can_ be both a master and slave simultaneously, and there might be a way to change it's masterURL (where it pulls from when acting as a slave) without restarting the core too. You can supply a 'custom' (not configured) masterURL in a manual 'pull' command (over HTTP), but of course usually slaves poll rather than be directed by manual 'pull' commands.
Re: What's your query result cache's stats?
On 5/31/2011 3:02 PM, Markus Jelsma wrote: Hi, I've seen the stats page many times, of quite a few installations and even more servers. There's one issue that keeps bothering me: the cumulative hit ratio of the query result cache, it's almost never higher than 50%. What are your stats? How do you deal with it? Below are my stats. I will be lowering my warmcounts dramatically when I respin for 3.1. The 28 second warm time is too high for me. I don't think it's going to make a lot of difference in performance. I think most of the warming benefit is realized after the first few queries. queryResultCache: Concurrent LRU Cache(maxSize=1024, initialSize=1024, minSize=921, acceptableSize=972, cleanupThread=true, autowarmCount=64, regenerator=org.apache.solr.search.SolrIndexSearcher$3@60c0c8b5) lookups : 932 hits : 528 hitratio : 0.56 inserts : 403 evictions : 0 size : 449 warmupTime : 28198 cumulative_lookups : 980357 cumulative_hits : 622726 cumulative_hitratio : 0.63 cumulative_inserts : 369692 cumulative_evictions : 83711 lookups : 68543 hits : 57286 hitratio : 0.83 inserts : 11357 evictions : 0 size : 11357 warmupTime : 0 cumulative_lookups : 219118491 cumulative_hits : 179119106 cumulative_hitratio : 0.81 cumulative_inserts : 3385 cumulative_evictions : 32833254 documentCache: LRU Cache(maxSize=16384, initialSize=4096) lookups : 68543 hits : 57286 hitratio : 0.83 inserts : 11357 evictions : 0 size : 11357 warmupTime : 0 cumulative_lookups : 219118491 cumulative_hits : 179119106 cumulative_hitratio : 0.81 cumulative_inserts : 3385 cumulative_evictions : 32833254 filterCache: LRU Cache(maxSize=512, initialSize=512, autowarmCount=32, regenerator=org.apache.solr.search.SolrIndexSearcher$2@6910b640) lookups : 859 hits : 464 hitratio : 0.54 inserts : 465 evictions : 0 size : 464 warmupTime : 27747 cumulative_lookups : 682600 cumulative_hits : 355130 cumulative_hitratio : 0.52 cumulative_inserts : 327479 cumulative_evictions : 161624
Re: What's your query result cache's stats?
I believe you need SOME query cache even with low hit counts, for things like a user paging through results. You want the query to still be in the cache when they go to the next page or what have you. Other operations like this may depend on the query cache too for good performance. So even with a low hit rate, you still want enough query cache that all the current queries, all the queries someone is in the middle of doing something with and may do more with can stay in the cache. (what things those are can depend on your particular client interface). So the cache hit count may not actually be a good guide to sizing your query cache. Correct me if I'm wrong, but this is what I've been thinking. On 6/1/2011 12:03 PM, Shawn Heisey wrote: On 5/31/2011 3:02 PM, Markus Jelsma wrote: Hi, I've seen the stats page many times, of quite a few installations and even more servers. There's one issue that keeps bothering me: the cumulative hit ratio of the query result cache, it's almost never higher than 50%. What are your stats? How do you deal with it? Below are my stats. I will be lowering my warmcounts dramatically when I respin for 3.1. The 28 second warm time is too high for me. I don't think it's going to make a lot of difference in performance. I think most of the warming benefit is realized after the first few queries. queryResultCache: Concurrent LRU Cache(maxSize=1024, initialSize=1024, minSize=921, acceptableSize=972, cleanupThread=true, autowarmCount=64, regenerator=org.apache.solr.search.SolrIndexSearcher$3@60c0c8b5) lookups : 932 hits : 528 hitratio : 0.56 inserts : 403 evictions : 0 size : 449 warmupTime : 28198 cumulative_lookups : 980357 cumulative_hits : 622726 cumulative_hitratio : 0.63 cumulative_inserts : 369692 cumulative_evictions : 83711 lookups : 68543 hits : 57286 hitratio : 0.83 inserts : 11357 evictions : 0 size : 11357 warmupTime : 0 cumulative_lookups : 219118491 cumulative_hits : 179119106 cumulative_hitratio : 0.81 cumulative_inserts : 3385 cumulative_evictions : 32833254 documentCache: LRU Cache(maxSize=16384, initialSize=4096) lookups : 68543 hits : 57286 hitratio : 0.83 inserts : 11357 evictions : 0 size : 11357 warmupTime : 0 cumulative_lookups : 219118491 cumulative_hits : 179119106 cumulative_hitratio : 0.81 cumulative_inserts : 3385 cumulative_evictions : 32833254 filterCache: LRU Cache(maxSize=512, initialSize=512, autowarmCount=32, regenerator=org.apache.solr.search.SolrIndexSearcher$2@6910b640) lookups : 859 hits : 464 hitratio : 0.54 inserts : 465 evictions : 0 size : 464 warmupTime : 27747 cumulative_lookups : 682600 cumulative_hits : 355130 cumulative_hitratio : 0.52 cumulative_inserts : 327479 cumulative_evictions : 161624
Re: Solr memory consumption
Here is output after about 24 hours running solr. Maybe there is some way to limit memory consumption? :( test@d6 ~/solr/example $ java -Xms3g-Xmx6g-D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar 2011-05-31 17:05:14.265:INFO::Logging to STDERR via org.mortbay.log.StdErrLog 2011-05-31 17:05:14.355:INFO::jetty-6.1-SNAPSHOT 2011-05-31 17:05:16.447:INFO::Started SocketConnector@0.0.0.0:4900 # # A fatal error has been detected by the Java Runtime Environment: # # java.lang.OutOfMemoryError: requested 32744 bytes for ChunkPool::allocate. Out of swap space? # # Internal Error (allocation.cpp:117), pid=17485, tid=1090320704 # Error: ChunkPool::allocate # # JRE version: 6.0_17-b17 # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.7.5 # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) # An error report file with more information is saved as: # /mnt/data/solr/example/hs_err_pid17485.log # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # Aborted I run multiple-core solr with flags: -Xms3g -Xmx6g -D64, but i see this in top after 6-8 hours and still raising: 17485 test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java -Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar Are there any ways to limit memory for sure? Thanks
Re: Solr vs ElasticSearch
On Wed, 01 Jun 2011 11:47 -0400, Jonathan Rochkind rochk...@jhu.edu wrote: On 6/1/2011 11:26 AM, Upayavira wrote: Probably the ReplicationHandler would need a 'one-off' replication command... It's got one already, if you mean a command you can issue to a slave to tell it to pull replication right now. The thing is, you can only issue this command if the core is configured as a slave. You can turn off polling though. You can include a custom masterURL in the one-off pull command, which over-rides whatever masterURL is configured in the core --- but you still need a masterURL configured in the core, or Solr will complain on startup if the core is configured as slave without a masterURL. (And if it's not configured as a slave, you can't issue the one-off pull command). Right, but this wouldn't be a slave - so I'd want to wire the destination core so that it can accept a 'pull request' without being correctly configured. Stuff to look at. Upayavira
Re: Index vs. Query Time Aware Filters
I should have explained that the queryMode parameter is for our own custom filter. So the result is that we have 8 filters in our field definition. All the filter parameters (30 or so) of the query time and index time are identical EXCEPT for our one custom filter which needs to know if it's in query time or index time mode. If we could determine inside our custom code whether we're indexing or querying, then we could omit the query time definition entirely and save about 50 lines of configuration and be much less error prone. One possible solution would be if we could get at the SolrCore from within a filter. Then at init time we could iterate through the filter chains and determine when we find a factory == this. (I've done this in other places where it's useful to know the name of a ValueSourceParser for example) -- View this message in context: http://lucene.472066.n3.nabble.com/Index-vs-Query-Time-Aware-Filters-tp3009450p3011556.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: K-Stemmer for Solr 3.1
Thanks. Ill have to create a Jira account to vote i guess. We are already using KStemmer in 1.4.2 production and I would like to upgrade to 3.1. In the meantime, what is another stemmer I could use out of the box that would be have similar to KStemmer? Thanks On 5/28/11 10:02 AM, Steven A Rowe wrote: Hi Mark, Yonik Seeley indicated on LUCENE-152 that he is considering contributing Lucid's KStemmer version to Lucene: https://issues.apache.org/jira/browse/LUCENE-152?focusedCommentId=13035647page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13035647 You can vote on the issue to communicate your interest. Steve -Original Message- From: Mark [mailto:static.void@gmail.com] Sent: Friday, May 27, 2011 7:31 PM To: solr-user@lucene.apache.org Subject: Re: K-Stemmer for Solr 3.1 Where can one find the KStemmer source for 4.0? On 5/12/11 11:28 PM, Bernd Fehling wrote: I backported a Lucid KStemmer version from solr 4.0 which I found somewhere. Just changed from import org.apache.lucene.analysis.util.CharArraySet; // solr4.0 to import org.apache.lucene.analysis.CharArraySet; // solr3.1 Bernd Am 12.05.2011 16:32, schrieb Mark: java.lang.AbstractMethodError: org.apache.lucene.analysis.TokenStream.incrementToken()Z Would you mind explaining your modifications? Thanks On 5/11/11 11:14 PM, Bernd Fehling wrote: Am 12.05.2011 02:05, schrieb Mark: It appears that the older version of the Lucid Works KStemmer is incompatible with Solr 3.1. Has anyone been able to get this to work? If not, what are you using as an alternative? Thanks Lucid KStemmer works nice with Solr3.1 after some minor mods to KStemFilter.java and KStemFilterFactory.java. What problems do you have? Bernd
Re: Solr memory consumption
Are you in fact out of swap space, as the java error suggested? The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g eventually. The JVM doesn't Garbage Collect until it's going to run out of heap space, until it gets to your Xmx. It will keep using RAM until it reaches your Xmx. If your Xmx is set so high you don't have enough RAM available, that will be a problem, you don't want to set Xmx like this. Ideally you don't even want to swap, but normally the OS will swap to give you enough RAM if neccesary -- if you don't have swap space for it to do that, to give the JVM the 6g you've configured it to take well, that seems to be what the Java error message is telling you. Of course sometimes error messages are misleading. But yes, if you set Xmx to 6G, the process WILL use all 6G eventually. This is just how the JVM works. On 6/1/2011 12:15 PM, Denis Kuzmenok wrote: Here is output after about 24 hours running solr. Maybe there is some way to limit memory consumption? :( test@d6 ~/solr/example $ java -Xms3g-Xmx6g-D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar 2011-05-31 17:05:14.265:INFO::Logging to STDERR via org.mortbay.log.StdErrLog 2011-05-31 17:05:14.355:INFO::jetty-6.1-SNAPSHOT 2011-05-31 17:05:16.447:INFO::Started SocketConnector@0.0.0.0:4900 # # A fatal error has been detected by the Java Runtime Environment: # # java.lang.OutOfMemoryError: requested 32744 bytes for ChunkPool::allocate. Out of swap space? # # Internal Error (allocation.cpp:117), pid=17485, tid=1090320704 # Error: ChunkPool::allocate # # JRE version: 6.0_17-b17 # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.7.5 # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) # An error report file with more information is saved as: # /mnt/data/solr/example/hs_err_pid17485.log # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # Aborted I run multiple-core solr with flags: -Xms3g -Xmx6g -D64, but i see this in top after 6-8 hours and still raising: 17485 test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java -Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar Are there any ways to limit memory for sure? Thanks
Re: Solr vs ElasticSearch
You _could_ configure it as a slave, if you plan to sometimes use it as a slave. It can be configured as both a master and a slave. You can configure it as a slave, but turn off automatic polling. And then issue one-off replicate commands whenever you want. But yeah, it gets messy, your use case is definitely not what ReplicationHandler is expecting, definitely some Java improvements would be nice, agreed. On 6/1/2011 12:20 PM, Upayavira wrote: On Wed, 01 Jun 2011 11:47 -0400, Jonathan Rochkindrochk...@jhu.edu wrote: On 6/1/2011 11:26 AM, Upayavira wrote: Probably the ReplicationHandler would need a 'one-off' replication command... It's got one already, if you mean a command you can issue to a slave to tell it to pull replication right now. The thing is, you can only issue this command if the core is configured as a slave. You can turn off polling though. You can include a custom masterURL in the one-off pull command, which over-rides whatever masterURL is configured in the core --- but you still need a masterURL configured in the core, or Solr will complain on startup if the core is configured as slave without a masterURL. (And if it's not configured as a slave, you can't issue the one-off pull command). Right, but this wouldn't be a slave - so I'd want to wire the destination core so that it can accept a 'pull request' without being correctly configured. Stuff to look at. Upayavira
Re: Solr memory consumption
So what should i do to evoid that error? I can use 10G on server, now i try to run with flags: java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64 Or should i set xmx to lower numbers and what about other params? Sorry, i don't know much about java/jvm =( Wednesday, June 1, 2011, 7:29:50 PM, you wrote: Are you in fact out of swap space, as the java error suggested? The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g eventually. The JVM doesn't Garbage Collect until it's going to run out of heap space, until it gets to your Xmx. It will keep using RAM until it reaches your Xmx. If your Xmx is set so high you don't have enough RAM available, that will be a problem, you don't want to set Xmx like this. Ideally you don't even want to swap, but normally the OS will swap to give you enough RAM if neccesary -- if you don't have swap space for it to do that, to give the JVM the 6g you've configured it to take well, that seems to be what the Java error message is telling you. Of course sometimes error messages are misleading. But yes, if you set Xmx to 6G, the process WILL use all 6G eventually. This is just how the JVM works.
best way to update custom fieldcache after index commit?
Hi, We use solr and lucene fieldcache like this static DocTerms myfieldvalues = org.apache.lucene.search.FieldCache.DEFAULT.getTerms(reader, myField); which is initialized at first use and will stay in memory for fast retrieval of field values based on DocID The problem is after an index/commit, the lucene fieldcache is reloaded in the new searcher, but this static list need to updated as well, what is the best way to handle this? Basically we want to update those custom filedcache whenever there is a commit. The possible solution I can think of: 1) manually call an request handler to clean up those custom stuffs after commit, which is a hack and ugly. 2) use some listener event (not sure whether I can use newSearcher event listener in Solr); also there seems to be a lucene ticket ( https://issues.apache.org/jira/browse/LUCENE-2474, Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)), not clear to me how to use it though Any of your suggestion/comments is much appreciated. Thanks! oleole
Re: Solr memory consumption
There is no simple answer. All I can say is you don't usually want to use an Xmx that's more than you actually have available RAM, and _can't_ use more than you have available ram+swap, and the Java error seems to be suggesting you are using more than is available in ram+swap. That may not be what's going on, JVM memory issues are indeed confusing. Why don't you start smaller, and see what happens. But if you end up needing more RAM for your Solr than you have available on the server, then you're just going to need more RAM. You may have to learn something about java/jvm to do memory tuning for Solr. Or, just start with the default parameters from the Solr example jetty, and if you don't run into any problems, then great. Starting with the example jetty shipped with Solr would be the easiest way to get started for someone who doesn't know much about Java/JVM. On 6/1/2011 12:37 PM, Denis Kuzmenok wrote: So what should i do to evoid that error? I can use 10G on server, now i try to run with flags: java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64 Or should i set xmx to lower numbers and what about other params? Sorry, i don't know much about java/jvm =( Wednesday, June 1, 2011, 7:29:50 PM, you wrote: Are you in fact out of swap space, as the java error suggested? The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g eventually. The JVM doesn't Garbage Collect until it's going to run out of heap space, until it gets to your Xmx. It will keep using RAM until it reaches your Xmx. If your Xmx is set so high you don't have enough RAM available, that will be a problem, you don't want to set Xmx like this. Ideally you don't even want to swap, but normally the OS will swap to give you enough RAM if neccesary -- if you don't have swap space for it to do that, to give the JVM the 6g you've configured it to take well, that seems to be what the Java error message is telling you. Of course sometimes error messages are misleading. But yes, if you set Xmx to 6G, the process WILL use all 6G eventually. This is just how the JVM works.
Re: Solr memory consumption
Overall memory on server is 24G, and 24G of swap, mostly all the time swap is free and is not used at all, that's why no free swap sound strange to me.. There is no simple answer. All I can say is you don't usually want to use an Xmx that's more than you actually have available RAM, and _can't_ use more than you have available ram+swap, and the Java error seems to be suggesting you are using more than is available in ram+swap. That may not be what's going on, JVM memory issues are indeed confusing. Why don't you start smaller, and see what happens. But if you end up needing more RAM for your Solr than you have available on the server, then you're just going to need more RAM. You may have to learn something about java/jvm to do memory tuning for Solr. Or, just start with the default parameters from the Solr example jetty, and if you don't run into any problems, then great. Starting with the example jetty shipped with Solr would be the easiest way to get started for someone who doesn't know much about Java/JVM. On 6/1/2011 12:37 PM, Denis Kuzmenok wrote: So what should i do to evoid that error? I can use 10G on server, now i try to run with flags: java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64 Or should i set xmx to lower numbers and what about other params? Sorry, i don't know much about java/jvm =( Wednesday, June 1, 2011, 7:29:50 PM, you wrote: Are you in fact out of swap space, as the java error suggested? The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g eventually. The JVM doesn't Garbage Collect until it's going to run out of heap space, until it gets to your Xmx. It will keep using RAM until it reaches your Xmx. If your Xmx is set so high you don't have enough RAM available, that will be a problem, you don't want to set Xmx like this. Ideally you don't even want to swap, but normally the OS will swap to give you enough RAM if neccesary -- if you don't have swap space for it to do that, to give the JVM the 6g you've configured it to take well, that seems to be what the Java error message is telling you. Of course sometimes error messages are misleading. But yes, if you set Xmx to 6G, the process WILL use all 6G eventually. This is just how the JVM works.
Re: Solr memory consumption
PermSize and MaxPermSize don't need to be higher than 64M. You should read on JVM tuning. The permanent generation is only used for the code that's being executed. So what should i do to evoid that error? I can use 10G on server, now i try to run with flags: java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64 Or should i set xmx to lower numbers and what about other params? Sorry, i don't know much about java/jvm =( Wednesday, June 1, 2011, 7:29:50 PM, you wrote: Are you in fact out of swap space, as the java error suggested? The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g eventually. The JVM doesn't Garbage Collect until it's going to run out of heap space, until it gets to your Xmx. It will keep using RAM until it reaches your Xmx. If your Xmx is set so high you don't have enough RAM available, that will be a problem, you don't want to set Xmx like this. Ideally you don't even want to swap, but normally the OS will swap to give you enough RAM if neccesary -- if you don't have swap space for it to do that, to give the JVM the 6g you've configured it to take well, that seems to be what the Java error message is telling you. Of course sometimes error messages are misleading. But yes, if you set Xmx to 6G, the process WILL use all 6G eventually. This is just how the JVM works.
Re: Solr memory consumption
Could be related to your crazy high MaxPermSize like Marcus said. I'm no JVM tuning expert either. Few people are, it's confusing. So if you don't understand it either, why are you trying to throw in very non-standard parameters you don't understand? Just start with whatever the Solr example jetty has, and only change things if you have a reason to (that you understand). On 6/1/2011 1:19 PM, Denis Kuzmenok wrote: Overall memory on server is 24G, and 24G of swap, mostly all the time swap is free and is not used at all, that's why no free swap sound strange to me.. There is no simple answer. All I can say is you don't usually want to use an Xmx that's more than you actually have available RAM, and _can't_ use more than you have available ram+swap, and the Java error seems to be suggesting you are using more than is available in ram+swap. That may not be what's going on, JVM memory issues are indeed confusing. Why don't you start smaller, and see what happens. But if you end up needing more RAM for your Solr than you have available on the server, then you're just going to need more RAM. You may have to learn something about java/jvm to do memory tuning for Solr. Or, just start with the default parameters from the Solr example jetty, and if you don't run into any problems, then great. Starting with the example jetty shipped with Solr would be the easiest way to get started for someone who doesn't know much about Java/JVM. On 6/1/2011 12:37 PM, Denis Kuzmenok wrote: So what should i do to evoid that error? I can use 10G on server, now i try to run with flags: java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64 Or should i set xmx to lower numbers and what about other params? Sorry, i don't know much about java/jvm =( Wednesday, June 1, 2011, 7:29:50 PM, you wrote: Are you in fact out of swap space, as the java error suggested? The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g eventually. The JVM doesn't Garbage Collect until it's going to run out of heap space, until it gets to your Xmx. It will keep using RAM until it reaches your Xmx. If your Xmx is set so high you don't have enough RAM available, that will be a problem, you don't want to set Xmx like this. Ideally you don't even want to swap, but normally the OS will swap to give you enough RAM if neccesary -- if you don't have swap space for it to do that, to give the JVM the 6g you've configured it to take well, that seems to be what the Java error message is telling you. Of course sometimes error messages are misleading. But yes, if you set Xmx to 6G, the process WILL use all 6G eventually. This is just how the JVM works.
Limit data stored from fmap.content with Solr cell
Hello everyone, I have just gotten extracting information from files with Solr Cell. Some of the files we are indexing are large, and have much content. I would like to limit the amount of data I index to a specified limit of characters (example 300 chars) which I will use as a document preview. Is this possible to set as a parameter with the fmap.content param, of must I index it all and then do a copyfield but just with a specified number of characters? Thanks in advance Greg
Re: Solr memory consumption
There were no parameters at all, and java hitted out of memory almost every day, then i tried to add parameters but nothing changed. Xms/Xmx - did not solve the problem too. Now i try the MaxPermSize, because it's the last thing i didn't try yet :( Wednesday, June 1, 2011, 9:00:56 PM, you wrote: Could be related to your crazy high MaxPermSize like Marcus said. I'm no JVM tuning expert either. Few people are, it's confusing. So if you don't understand it either, why are you trying to throw in very non-standard parameters you don't understand? Just start with whatever the Solr example jetty has, and only change things if you have a reason to (that you understand). On 6/1/2011 1:19 PM, Denis Kuzmenok wrote: Overall memory on server is 24G, and 24G of swap, mostly all the time swap is free and is not used at all, that's why no free swap sound strange to me..
Newbie question: how to deal with different # of search results per page due to pagination then grouping
Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-different-of-search-results-per-page-due-to-pagination-then-grouping-tp3012168p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
Change default scoring formula
Hi All, I need to change the default scoring formula of solr. How shall I hack the code to do so? also, is there any way to stop solr to do its default scoring and sorting? Thanks, Gaurav -- View this message in context: http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012196.html Sent from the Solr - User mailing list archive at Nabble.com.
Debugging a Solr/Jetty Hung Process
About once a day a Solr/Jetty process gets hung on my server consuming 100% of one of the CPU's. Once this happens the server no longer responds to requests. I've looked through the logs to try and see if anything stands out but so far I've found nothing out of the ordinary. My current remedy is to log in and just kill the single processes that's hung. Once that happens everything goes back to normal and I'm good for a day or so. I'm currently the running following: solr-jetty-1.4.0+ds1-1ubuntu1 which is comprised of Solr 1.4.0 Jetty 6.1.22 on Unbuntu 10.10 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just looking for advice on how I should go about trouble shooting this problem. Chris
Re: Debugging a Solr/Jetty Hung Process
Taking a thread dump will take you what's going. Bill On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan chrisco...@plus3network.comwrote: About once a day a Solr/Jetty process gets hung on my server consuming 100% of one of the CPU's. Once this happens the server no longer responds to requests. I've looked through the logs to try and see if anything stands out but so far I've found nothing out of the ordinary. My current remedy is to log in and just kill the single processes that's hung. Once that happens everything goes back to normal and I'm good for a day or so. I'm currently the running following: solr-jetty-1.4.0+ds1-1ubuntu1 which is comprised of Solr 1.4.0 Jetty 6.1.22 on Unbuntu 10.10 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just looking for advice on how I should go about trouble shooting this problem. Chris
Re: CLOSE_WAIT after connecting to multiple shards from a primary shard
Hi Otis, Sending to solr-user mailing list. We see this CLOSE_WAIT connections even when i do a simple http request via curl, that is, even when i do a simple curl using a primary and secondary shard query, like for e.g. curl http://primaryshardhost:8180/solr/core0/select?q=*%3A*shards=secondaryshardhost1:8090/solr/appgroup1_11053000_11053100 While fetching data it is in ESTABLISHED state -sh-3.2$ netstat | grep ESTABLISHED | grep 8090 tcp0 0 primaryshardhost:36805 secondaryshardhost1:8090 ESTABLISHED After the request has come back, it is in CLOSE_WAIT state -sh-3.2$ netstat | grep CLOSE_WAIT | grep 8090 tcp1 0 primaryshardhost:36805 secondaryshardhost1:8090 CLOSE_WAIT why does Solr keep the connection to the shards in CLOSE_WAIT? Is this a feature of Solr? If we modify an OS property (I dont know how) to cleanup the CLOSE_WAITs will it cause an issue with subsequent searches? Can someone help me please? thanks, Mukunda On Mon, May 30, 2011 at 5:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, A few things: 1) why not send this to the Solr list? 2) you talk about searching, but the code sample is about optimizing the index. 3) I don't have SolrJ API in front of me, but isn't there is CommonsSolrServe ctor that takes in a URL instead of HttpClient instance? Try that one. Otis - Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Mukunda Madhava mukunda...@gmail.com To: gene...@lucene.apache.org Sent: Mon, May 30, 2011 1:54:07 PM Subject: CLOSE_WAIT after connecting to multiple shards from a primary shard Hi, We are having a primary Solr shard, and multiple secondary shards. We query data from the secondary shards by specifying the shards param in the query params. But we found that after recieving the data, there are large number of CLOSE_WAIT on the secondary shards from the primary shards. Like for e.g. tcp1 0 primaryshardhost:56109 secondaryshardhost1:8090 CLOSE_WAIT tcp1 0 primaryshardhost:51049 secondaryshardhost1:8090 CLOSE_WAIT tcp1 0 primaryshardhost:49537 secondaryshardhost1:8089 CLOSE_WAIT tcp1 0 primaryshardhost:44109 secondaryshardhost2:8090 CLOSE_WAIT tcp1 0 primaryshardhost:32041 secondaryshardhost2:8090 CLOSE_WAIT tcp1 0 primaryshardhost:48533 secondaryshardhost2:8089 CLOSE_WAIT We open the Solr connections as below.. SimpleHttpConnectionManager cm = new SimpleHttpConnectionManager(true); cm.closeIdleConnections(0L); HttpClient httpClient = new HttpClient(cm); solrServer = new CommonsHttpSolrServer(url,httpClient); solrServer.optimize(); But still we see these issues. Any ideas? -- Thanks, Mukunda -- Thanks, Mukunda
Re: Debugging a Solr/Jetty Hung Process
I'm pretty green... is that something I can do while the event is happening or is there something I need to configure to capture the dump ahead of time. I've tried to reproduce the problem by putting the server under load but that doesn't seem to be the issue. Chris On Jun 1, 2011, at 12:06 PM, Bill Au wrote: Taking a thread dump will take you what's going. Bill On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan chrisco...@plus3network.comwrote: About once a day a Solr/Jetty process gets hung on my server consuming 100% of one of the CPU's. Once this happens the server no longer responds to requests. I've looked through the logs to try and see if anything stands out but so far I've found nothing out of the ordinary. My current remedy is to log in and just kill the single processes that's hung. Once that happens everything goes back to normal and I'm good for a day or so. I'm currently the running following: solr-jetty-1.4.0+ds1-1ubuntu1 which is comprised of Solr 1.4.0 Jetty 6.1.22 on Unbuntu 10.10 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just looking for advice on how I should go about trouble shooting this problem. Chris
Re: Debugging a Solr/Jetty Hung Process
Sorry ... I just found it. I will try that next time. I have a feeling it wont work since the server usually stops accepting connections. Chris On Jun 1, 2011, at 12:12 PM, Chris Cowan wrote: I'm pretty green... is that something I can do while the event is happening or is there something I need to configure to capture the dump ahead of time. I've tried to reproduce the problem by putting the server under load but that doesn't seem to be the issue. Chris On Jun 1, 2011, at 12:06 PM, Bill Au wrote: Taking a thread dump will take you what's going. Bill On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan chrisco...@plus3network.comwrote: About once a day a Solr/Jetty process gets hung on my server consuming 100% of one of the CPU's. Once this happens the server no longer responds to requests. I've looked through the logs to try and see if anything stands out but so far I've found nothing out of the ordinary. My current remedy is to log in and just kill the single processes that's hung. Once that happens everything goes back to normal and I'm good for a day or so. I'm currently the running following: solr-jetty-1.4.0+ds1-1ubuntu1 which is comprised of Solr 1.4.0 Jetty 6.1.22 on Unbuntu 10.10 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just looking for advice on how I should go about trouble shooting this problem. Chris
Re: Edgengram
I think in my case LowerCaseTokenizerFactory will be sufficient because there will never be spaces in this particular field. But thank you for the useful link! Thanks, Brian Lamb On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson erickerick...@gmail.comwrote: Be a little careful here. LowerCaseTokenizerFactory is different than KeywordTokenizerFactory. LowerCaseTokenizerFactory will give you more than one term. e.g. the string Intelligence can't be MeaSurEd will give you 5 terms, any of which may match. i.e. intelligence, can, t, be, measured. whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter would give you exactly one token: intelligence can't be measured. So searching for measured would get a hit in the first case but not in the second. Searching for intellig* would hit both. Neither is better, just make sure they do what you want! This page will help a lot: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory as will the admin/analysis page. Best Erick On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi Tomás, Thank you very much for your suggestion. I took another crack at it using your recommendation and it worked ideally. The only thing I had to change was analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer to analyzer type=query tokenizer class=solr.LowerCaseTokenizerFactory / /analyzer The first did not produce any results but the second worked beautifully. Thanks! Brian Lamb 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com ...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May
Re: Change default scoring formula
Hi Gaurav, not sure what your use case is (and if no sorting at all is ever required, is Solr / Lucene what you need?). You can certainly sort by a field (or more) in descendant or ascendant order by using the sort parameter. You can customize the scoring algorithm by overriding the DefaultSimilarity class, but first make sure that this is what you need, as most use cases can be implemented with the default similarity plus queries / filter queries / function queries, etc. Regards, Tomás On Wed, Jun 1, 2011 at 4:02 PM, ngaurav2005 ngaurav2...@gmail.com wrote: Hi All, I need to change the default scoring formula of solr. How shall I hack the code to do so? also, is there any way to stop solr to do its default scoring and sorting? Thanks, Gaurav -- View this message in context: http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012196.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping
There's no great way to do that. One approach would be using facets, but that will just get you the author names (as stored in fields), and not the documents under it. If you really only want to show the author names, facets could work. One issue with facets though is Solr won't tell you the total number of facet values for your query, so it's tricky to provide next/prev paging through them. There is also a 'field collapsing' feature that I think is not in a released Solr, but may be in the Solr repo. I'm not sure it will quite do what you want either though, although it's related and worth a look. http://wiki.apache.org/solr/FieldCollapsing Another vaguely related thing that is also not yet in a released Solr, is a 'join' function. That could possibly be used to do what you want, although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272 Jonathan On 6/1/2011 2:56 PM, beccax wrote: Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-different-of-search-results-per-page-due-to-pagination-then-grouping-tp3012168p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
Searching using a PDF
Is it possible to do a search based on a PDF file? I know its possible to update the index with a PDF but can you do just a regular search with it? Thanks, Brian Lamb
Re: Debugging a Solr/Jetty Hung Process
First guess (and it really is just a guess) would be Java garbage collection taking over. There are some JVM parameters you can use to tune the GC process, especially if the machine is multi-core, making sure GC happens in a seperate thread is helpful. But figuring out exactly what's going on requires confusing JVM debugging of which I am no expert at either. On 6/1/2011 3:04 PM, Chris Cowan wrote: About once a day a Solr/Jetty process gets hung on my server consuming 100% of one of the CPU's. Once this happens the server no longer responds to requests. I've looked through the logs to try and see if anything stands out but so far I've found nothing out of the ordinary. My current remedy is to log in and just kill the single processes that's hung. Once that happens everything goes back to normal and I'm good for a day or so. I'm currently the running following: solr-jetty-1.4.0+ds1-1ubuntu1 which is comprised of Solr 1.4.0 Jetty 6.1.22 on Unbuntu 10.10 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just looking for advice on how I should go about trouble shooting this problem. Chris
Re: best way to update custom fieldcache after index commit?
How are you implementing your custom cache? If you're defining it in the solrconfig, couldn't you implement the regenerator? See: http://wiki.apache.org/solr/SolrCaching#User.2BAC8-Generic_Caches Best Erick On Wed, Jun 1, 2011 at 12:38 PM, oleole oleol...@gmail.com wrote: Hi, We use solr and lucene fieldcache like this static DocTerms myfieldvalues = org.apache.lucene.search.FieldCache.DEFAULT.getTerms(reader, myField); which is initialized at first use and will stay in memory for fast retrieval of field values based on DocID The problem is after an index/commit, the lucene fieldcache is reloaded in the new searcher, but this static list need to updated as well, what is the best way to handle this? Basically we want to update those custom filedcache whenever there is a commit. The possible solution I can think of: 1) manually call an request handler to clean up those custom stuffs after commit, which is a hack and ugly. 2) use some listener event (not sure whether I can use newSearcher event listener in Solr); also there seems to be a lucene ticket ( https://issues.apache.org/jira/browse/LUCENE-2474, Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)), not clear to me how to use it though Any of your suggestion/comments is much appreciated. Thanks! oleole
Re: Limit data stored from fmap.content with Solr cell
If you can live with an across-the-board limit, you can set maxFieldLength in your solrconfig.xml file. Note that this is in terms rather than chars though... Best Erick On Wed, Jun 1, 2011 at 2:22 PM, Greg Georges greg.geor...@biztree.com wrote: Hello everyone, I have just gotten extracting information from files with Solr Cell. Some of the files we are indexing are large, and have much content. I would like to limit the amount of data I index to a specified limit of characters (example 300 chars) which I will use as a document preview. Is this possible to set as a parameter with the fmap.content param, of must I index it all and then do a copyfield but just with a specified number of characters? Thanks in advance Greg
NRT facet search options comparison
Hi, I need to provide NRT search with faceting. Been looking at the options out there. Wondered if anyone could clarify some questions I have and perhaps share your NRT experiences. The various NRT options: 1) Solr -Solr doesn't have NRT, yet. What is the expected time frame for NRT? Is it a few months or more like a year? -How would Solr faceting work with NRT? My understanding is that faceting in Solr relies on caching, which doesn't go well with NRT updates. When NRT arrives, would facet performance take a huge drop when using with NRT because of this caching issue? 2) ElasticSearch -ES supports NRT so that's great. Does anyone have experiences with ES that they could share? Does faceting work with NRT in ES? Any Solr features that are missing in ES? 3) Solr-RA -Read in this list about Solr-RA, which has NRT support. Has anyone used it? Can you share your experiences? -Again not sure if facet would work with Solr-RA NRT. Solr-RA is based on Solr, so faceting in Solr-RA relies on caching I suppose. Does NRT affect facet performance? 4) Zoie plugin for Solr -Zoie is a NRT search library. I tried but couldn't get the Zoie plugin to work with Solr. Always got the error message of opening too many Searchers. Has anyone got this to work? Any other options? Thanks Andy
Re: Searching using a PDF
I'm not quite sure what you mean by regular search. When you index a PDF (Presumably through Tika or Solr Cell) the text is indexed into your index and you can certainly search that. Additionally, there may be meta data indexed in specific fields (e.g. author, date modified, etc). But what does search based on a PDF file mean in your context? Best Erick On Wed, Jun 1, 2011 at 3:41 PM, Brian Lamb brian.l...@journalexperts.com wrote: Is it possible to do a search based on a PDF file? I know its possible to update the index with a PDF but can you do just a regular search with it? Thanks, Brian Lamb
Re: Change default scoring formula
Thanks Tomas. Well I am sorting results by a function query. I donot want solr to do extra effort in calculating score for each document and eat up my cpu cycles. Also, I need to use if condition in score calculation, which I emulated through map function, but map function do not accept a function as one of the values. This causes me to write my own scoring algorithm. Can you help me with the steps or link to any post which explains step by step overriding(DefaultSimilarity class) default sorting algorithm? Thanks in advance. Gaurav -- View this message in context: http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012372.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Spellcheck Phrases
Tanner, I just entered SOLR-2571 to fix the float-parsing-bug that breaks thresholdTokenFrequency. Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1. See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches. This parameter appears absent from the wiki. And as it has always been broken for me, I haven't tested it. However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary. For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary. This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ... searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext/str str name=spellcheckIndexDir./spellchecker/str str name=thresholdTokenFrequency.01/str /lst /searchComponent James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Friday, May 27, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck Phrases are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str /lst /searchComponent I have it on my to-do list to look into this further but haven't yet. If you decide to try it and can get it to work, please let me know how you do it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Wednesday, February 23, 2011 12:53 PM To: solr-user@lucene.apache.org Subject: Spellcheck Phrases right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words and are probably found in a few pieces of our content. My question is: is there any way for it to recognize that the phase should be break a leg and not brake a leg and suggest the proper phrase?
RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping
Don't manually group by author from your results, the list will always be incomplete... use faceting instead to show the authors of the books you have found in your search. http://wiki.apache.org/solr/SolrFacetingOverview -Original Message- From: beccax [mailto:bec...@gmail.com] Sent: Wednesday, June 01, 2011 11:56 AM To: solr-user@lucene.apache.org Subject: Newbie question: how to deal with different # of search results per page due to pagination then grouping Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping
I think facet.offset allows facet paging nicely by letting you index into the list of facet values. It is working for me... http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping There's no great way to do that. One approach would be using facets, but that will just get you the author names (as stored in fields), and not the documents under it. If you really only want to show the author names, facets could work. One issue with facets though is Solr won't tell you the total number of facet values for your query, so it's tricky to provide next/prev paging through them. There is also a 'field collapsing' feature that I think is not in a released Solr, but may be in the Solr repo. I'm not sure it will quite do what you want either though, although it's related and worth a look. http://wiki.apache.org/solr/FieldCollapsing Another vaguely related thing that is also not yet in a released Solr, is a 'join' function. That could possibly be used to do what you want, although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272 Jonathan On 6/1/2011 2:56 PM, beccax wrote: Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping
How do you know whether to provide a 'next' button, or whether you are the end of your facet list? On 6/1/2011 4:47 PM, Robert Petersen wrote: I think facet.offset allows facet paging nicely by letting you index into the list of facet values. It is working for me... http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping There's no great way to do that. One approach would be using facets, but that will just get you the author names (as stored in fields), and not the documents under it. If you really only want to show the author names, facets could work. One issue with facets though is Solr won't tell you the total number of facet values for your query, so it's tricky to provide next/prev paging through them. There is also a 'field collapsing' feature that I think is not in a released Solr, but may be in the Solr repo. I'm not sure it will quite do what you want either though, although it's related and worth a look. http://wiki.apache.org/solr/FieldCollapsing Another vaguely related thing that is also not yet in a released Solr, is a 'join' function. That could possibly be used to do what you want, although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272 Jonathan On 6/1/2011 2:56 PM, beccax wrote: Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping
Yes that is exactly the issue... we're thinking just maybe always have a next button and if you go too far you just get zero results. User gets what the user asks for, and so user could simply back up if desired to where the facet still has values. Could also detect an empty facet results on the front end. You can also only expand one facet only to allow paging only the facet pane and not the whole page using an ajax call. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 2:30 PM To: solr-user@lucene.apache.org Cc: Robert Petersen Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping How do you know whether to provide a 'next' button, or whether you are the end of your facet list? On 6/1/2011 4:47 PM, Robert Petersen wrote: I think facet.offset allows facet paging nicely by letting you index into the list of facet values. It is working for me... http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping There's no great way to do that. One approach would be using facets, but that will just get you the author names (as stored in fields), and not the documents under it. If you really only want to show the author names, facets could work. One issue with facets though is Solr won't tell you the total number of facet values for your query, so it's tricky to provide next/prev paging through them. There is also a 'field collapsing' feature that I think is not in a released Solr, but may be in the Solr repo. I'm not sure it will quite do what you want either though, although it's related and worth a look. http://wiki.apache.org/solr/FieldCollapsing Another vaguely related thing that is also not yet in a released Solr, is a 'join' function. That could possibly be used to do what you want, although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272 Jonathan On 6/1/2011 2:56 PM, beccax wrote: Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr memory consumption
Hey Denis, * How big is your index in terms of number of documents and index size? * Is it production system where you have many search requests? * Is there any pattern for OOM errors? I.e. right after you start your Solr app, after some search activity or specific Solr queries, etc? * What are 1) cache settings 2) facets and sort-by fields 3) commit frequency and warmup queries? etc Generally you might want to connect to your jvm using jconsole tool and monitor your heap usage (and other JVM/Solr numbers) * http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html * http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX HTH, Alexey 2011/6/1 Denis Kuzmenok forward...@ukr.net: There were no parameters at all, and java hitted out of memory almost every day, then i tried to add parameters but nothing changed. Xms/Xmx - did not solve the problem too. Now i try the MaxPermSize, because it's the last thing i didn't try yet :( Wednesday, June 1, 2011, 9:00:56 PM, you wrote: Could be related to your crazy high MaxPermSize like Marcus said. I'm no JVM tuning expert either. Few people are, it's confusing. So if you don't understand it either, why are you trying to throw in very non-standard parameters you don't understand? Just start with whatever the Solr example jetty has, and only change things if you have a reason to (that you understand). On 6/1/2011 1:19 PM, Denis Kuzmenok wrote: Overall memory on server is 24G, and 24G of swap, mostly all the time swap is free and is not used at all, that's why no free swap sound strange to me..
Re: DIH render html entities
Maybe HTMLStripTransformer is what you are looking for. * http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer On Tue, May 31, 2011 at 5:35 PM, Erick Erickson erickerick...@gmail.com wrote: Convert them to what? Individual fields in your docs? Text? If the former, you might get some joy from the XpathEntityProcessor. If you want to just strip the markup and index all the content you might get some joy from the various *html* analyzers listed here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Best Erick On Fri, May 27, 2011 at 5:19 AM, anass talby anass.ta...@gmail.com wrote: Sorry my question was not clear. when I get data from database, some field contains some html special chars, and what i want to do is just convert them automatically. On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty g...@mimirtech.com wrote: On Fri, May 27, 2011 at 3:50 PM, anass talby anass.ta...@gmail.com wrote: Is there any way to render html entities in DIH for a specific field? [...] This does not make too much sense: What do you mean by rendering HTML entities. DIH just indexes, so where would it render HTML to, even if it could? Please take a look at http://wiki.apache.org/solr/UsingMailingLists Regards, Gora -- Anass
Re: Better Spellcheck
I've tried to use a spellcheck dictionary built from my own content, but my content ends up having a lot of misspelled words so the spellcheck ends up being less than effective. You can try to use sp.dictionary.threshold parameter to solve this problem * http://wiki.apache.org/solr/SpellCheckerRequestHandler#sp.dictionary.threshold It also misses phrases. When someone searches for Untied States I would hope the spellcheck would suggest United States but it just recognizes that untied is a valid word and doesn't suggest any thing. So you are saying about auto suggest component and not spellcheck right? These are two different use cases. If you want auto suggest and you have some search logs for your system then you can probably use the following solution: * http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ If you don't have significant search logs history and want to populate your auto suggest dictionary from index or some text file you should check * http://wiki.apache.org/solr/Suggester
Re: Documents update
Will it be slow if there are 3-5 million key/value rows? AFAIK it shouldn't affect search time significantly as Solr caches it in memory after you reloading Solr core / issuing commit. But obviously you need more memory and commit/reload will take more time.
Re: NRT facet search options comparison
Hi Andy: Here is a white paper that shows screenshots of faceting working with Solr and RankingAlgorithm under NRT: http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search The implementation (src) is also available with the download and is described in the below document: http://solr-ra.tgels.com/papers/NRT_Solr_RankingAlgorithm.pdf The faceting test was done with the mbartists demo from the book, Solr-14-Enterprise-Search-Server and is approx around 390k docs. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.com http://rankingalgorithm.tgels.com On 6/1/2011 12:52 PM, Andy wrote: Hi, I need to provide NRT search with faceting. Been looking at the options out there. Wondered if anyone could clarify some questions I have and perhaps share your NRT experiences. The various NRT options: 1) Solr -Solr doesn't have NRT, yet. What is the expected time frame for NRT? Is it a few months or more like a year? -How would Solr faceting work with NRT? My understanding is that faceting in Solr relies on caching, which doesn't go well with NRT updates. When NRT arrives, would facet performance take a huge drop when using with NRT because of this caching issue? 2) ElasticSearch -ES supports NRT so that's great. Does anyone have experiences with ES that they could share? Does faceting work with NRT in ES? Any Solr features that are missing in ES? 3) Solr-RA -Read in this list about Solr-RA, which has NRT support. Has anyone used it? Can you share your experiences? -Again not sure if facet would work with Solr-RA NRT. Solr-RA is based on Solr, so faceting in Solr-RA relies on caching I suppose. Does NRT affect facet performance? 4) Zoie plugin for Solr -Zoie is a NRT search library. I tried but couldn't get the Zoie plugin to work with Solr. Always got the error message of opening too many Searchers. Has anyone got this to work? Any other options? Thanks Andy
Re: NRT facet search options comparison
Nagendra, Thanks. Can you comment on the performance impact of NRT on facet search? The pages you linked to don't really touch on that. My concern is that with NRT, the facet cache will be constantly invalidated. How will that impact the performance of faceting? Do you have any benchmark comparing the performance of facet search with and without NRT? Thanks Andy --- On Wed, 6/1/11, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: From: Nagendra Nagarajayya nnagaraja...@transaxtions.com Subject: Re: NRT facet search options comparison To: solr-user@lucene.apache.org Date: Wednesday, June 1, 2011, 11:29 PM Hi Andy: Here is a white paper that shows screenshots of faceting working with Solr and RankingAlgorithm under NRT: http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search The implementation (src) is also available with the download and is described in the below document: http://solr-ra.tgels.com/papers/NRT_Solr_RankingAlgorithm.pdf The faceting test was done with the mbartists demo from the book, Solr-14-Enterprise-Search-Server and is approx around 390k docs. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.com http://rankingalgorithm.tgels.com On 6/1/2011 12:52 PM, Andy wrote: Hi, I need to provide NRT search with faceting. Been looking at the options out there. Wondered if anyone could clarify some questions I have and perhaps share your NRT experiences. The various NRT options: 1) Solr -Solr doesn't have NRT, yet. What is the expected time frame for NRT? Is it a few months or more like a year? -How would Solr faceting work with NRT? My understanding is that faceting in Solr relies on caching, which doesn't go well with NRT updates. When NRT arrives, would facet performance take a huge drop when using with NRT because of this caching issue? 2) ElasticSearch -ES supports NRT so that's great. Does anyone have experiences with ES that they could share? Does faceting work with NRT in ES? Any Solr features that are missing in ES? 3) Solr-RA -Read in this list about Solr-RA, which has NRT support. Has anyone used it? Can you share your experiences? -Again not sure if facet would work with Solr-RA NRT. Solr-RA is based on Solr, so faceting in Solr-RA relies on caching I suppose. Does NRT affect facet performance? 4) Zoie plugin for Solr -Zoie is a NRT search library. I tried but couldn't get the Zoie plugin to work with Solr. Always got the error message of opening too many Searchers. Has anyone got this to work? Any other options? Thanks Andy
How to do custom scoring using query parameters?
Hi All, We need to score documents based on some parameters received in query string. Since this was not possible via function query as we need to use if condition, which can be emulated through map function, but one of the output values of if condition has to be function, where as map only accepts constants. So if I rephrase my requirements, it would be: 1. Calculate score for each document using query parameters(search parameters) 2. Sort these documents based on score So I know that I can change default scoring by overriding DefaultSimilarity class, but how does this class can receive query parameters, which are required for score calculation. Also, once score is calculated, how can I sort those results based on scores? Regards, Gaurav -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-do-custom-scoring-using-query-parameters-tp3013788p3013788.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with caps and star symbol
Its Working as I was looking for.Thanks Mr. Erick. On Wed, Jun 1, 2011 at 8:29 PM, Erick Erickson erickerick...@gmail.comwrote: Take a look here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory I think you want generateWordParts=1, catenateWords=1 and preserveOriginal=1, but check it out with the admin/analysis page. Oh, and your index-time and query-time patterns for WDFF will probably be different, see the example schema. Best Erick On Wed, Jun 1, 2011 at 7:40 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: Thanks for your point. I was really tripping that issue. But Now I need a bit help more. As far I have noticed that in the case of a value like *role_delete* , WordDelimiterFilterFactory index two words like *role* and *delete* and in both search result with the term *role* and *delete* will include that document. Now In the case of the value like *role_delete* I want to index all four terms like [ *role_delete, roledelete, role, delete ].* In total both the original and processed word by WordDelimiterFilterFactory will be indexed. Is it possible ?? Does any additional filter with WordDelimiterFilterFactory can do that ?? Or any filter can do such like operation ?? On Tue, May 31, 2011 at 8:07 PM, Erick Erickson erickerick...@gmail.com wrote: I think you're tripping over the issue that wildcards aren't analyzed, they don't go through your analysis chain. So the casing matters. Try lowercasing the input and I believe you'll see more like what you expect... Best Erick On Mon, May 30, 2011 at 12:07 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: I am sending some xml to understand the scenario. Indexed term = ROLE_DELETE Search Term = roledelete response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : roledelete/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 Indexed term = ROLE_DELETE Search Term = role response lst name=responseHeader int name=status0/int int name=QTime5/int lst name=params str name=indenton/str str name=start0/str str name=qname : role/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response Indexed term = ROLE_DELETE Search Term = role* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : role*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc str name=creationDateMon May 30 13:09:14 BDST 2011/str str name=displayNameGlobal Role for Deletion/str str name=idrole:9223372036854775802/str str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str str name=nameROLE_DELETE/str /doc /result /response Indexed term = ROLE_DELETE Search Term = Role* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : Role*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0/ /response Indexed term = ROLE_DELETE Search Term = ROLE_DELETE* response lst name=responseHeader int name=status0/int int name=QTime4/int lst name=params str name=indenton/str str name=start0/str str name=qname : ROLE_DELETE*/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0/ /response I am also adding a analysis html. On Mon, May 30, 2011 at 7:19 AM, Erick Erickson erickerick...@gmail.com wrote: I'd start by looking at the analysis page from the Solr admin page. That will give you an idea of the transformations the various steps carry out, it's invaluable! Best Erick On May 26, 2011 12:53 AM, Saumitra Chowdhury saumi...@smartitengineering.com wrote: Hi all , In my schema.xml i am using WordDelimiterFilterFactory, LowerCaseFilterFactory, StopFilterFactory for index