d: org.apache.http.ParseException: Invalid content type: - solr distributed search 4.10.4
Hi, When doing a distributed query from solr 4.10.4 ,getting below exception org.apache.solr.common.SolrException: org.apache.http.ParseException: Invalid content type: org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute(SolrCore.java:1976) ariba.arches.search.ArchesSearcher.invokeSearch(ArchesSearcher.java:306) ariba.arches.search.ArchesSearcher.search(ArchesSearcher.java:169) ariba.arches.search.SearchManagerServlet.handleSelect(SearchManagerServlet.java:651) ariba.arches.search.SearchManagerServlet.service(SearchManagerServlet.java:146) javax.servlet.http.HttpServlet.service(HttpServlet.java:848) query is below: http://:20042/ /search/select?q=(*:*)=xml=5=SupplierID,MarketPrice=:20042 /search/select/executeS2-63,:20022/ search/select/execute/S1-69 In the code Below method n SolrCore is used to execute the query. execute(SolrRequestHandler handler, SolrQueryRequest req, SolrQueryResponse rsp) { Saw ssame issue in https://lists.gt.net/lucene/java-dev/242650. If we test the distributed query in a stand alone solr as below it works http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:8984/solr=true=ipod+solr Any pointers to resolve this issue please. Thank you, Raji
Re: Fwd: Issue with SOLR Distributed Search
On 12/18/2014 12:35 AM, rashi gandhi wrote: Also, as per our investigation currently there is work ongoing in SOLR community to support this concept of distributed/Global IDF. But, I wanted to know if there is any solution possible right now to manage/control the score of the documents during distributed search, so that the results seem more relevant. SOLR-1632 covers the distributed IDF issue. Plans right now are to include this in Solr 5.0 when it is released. https://issues.apache.org/jira/browse/SOLR-1632 The only way to have a reasonably accurate distributed score currently is to load your shards as evenly as possible. A good way to do this is to use the hash value of the uniqueKey field as the deciding factor for which shard gets the request. This is what SolrCloud does if you let it handle the routing. Thanks, Shawn
Fwd: Issue with SOLR Distributed Search
Hi, This is regarding the issue that we are facing with SOLR distributed search. In our application, we are managing multiple shards at SOLR server to manage the load. But there is a problem with the order of results that we going to return to client during the search. For Example: Currently there are two shards on which data is randomly distributed. When I search something, it was observerd that the results from one shard appear first and then results from other shard. Moreover, we are ordering results by applying two levels of sorting (configurable as per user also): 1. Score 2. Modified Time I did investigations for the above scenario and found that it is not necessary that documents coming from one shard will always have the same score as documents coming from other shard, even if they are identical. I also went through the various SOLR documentations and links, and found that currently there is a limitation to distributed search in SOLR that Inverse-document frequency (IDF) calculations cannot be distributed and TF/IDF computations are per shard. This issue is particularly visible when there is significant difference between the number of documents indexed in each shard. (For Ex: first shard has 15000 docs and second shard has 5000). Please review and let me know whether our findings for the above scenario are appropriate or not. Also, as per our investigation currently there is work ongoing in SOLR community to support this concept of distributed/Global IDF. But, I wanted to know if there is any solution possible right now to manage/control the score of the documents during distributed search, so that the results seem more relevant. Thanks Rashi
Re: Solr-Distributed search
Hi, Does this *shards* parameter will also work in near future with solr 5? With Regards Aman Tandon On Thu, Jun 5, 2014 at 2:59 PM, Mahmoud Almokadem prog.mahm...@gmail.com wrote: Hi, you can search using this sample Url http://localhost:8080/solr/core1/select?q=*:*shards=localhost:8080/solr/core1,localhost:8080/solr/core2,localhost:8080/solr/core3 Mahmoud Almokadem On Thu, Jun 5, 2014 at 8:13 AM, Anurag Verma vermanur...@gmail.com wrote: Hi, Can you please help me solr distribued search in multicore? i would be very happy as i am stuck here. In java code how do i implement distributed search? -- Thanks Regards Anurag Verma
Re: Solr-Distributed search
On 6/6/2014 6:25 AM, Aman Tandon wrote: Does this *shards* parameter will also work in near future with solr 5? I am not aware of any plan to deprecate or remove the shards parameter. My personal experience is with versions from 1.4.0 through 4.7.2. It works in all of those versions. Without SolrCloud, the shards parameter is the only way you can do a distributed search. Thanks, Shawn
Re: Solr-Distributed search
Thanks shawn. In my organisation we also want to implement the solrcloud, but the problem is that, we are using the master-slave architecture and on master we do all indexing, architecture of master is lower than the slaves. So if we implement the solrcloud in a fashion that master will be the leader, and slaves will be the replicas then in that case, in the case of high load leader can bear it, I guess every query firstly goes to leader then it distributes the request as i noticed from the logs and blogs :) As well as master is in NY and slaves are in Dallas, which also might cause latency issue and it will instead fail our purpose of faster query response. So i thought to use this shards parameter so that we query only from the replicas not to the leader so that leader just work fine. But we were not sure about this shards parameter, what do you think? what should we do with latency issue and shards parameter. With Regards Aman Tandon On Fri, Jun 6, 2014 at 7:24 PM, Shawn Heisey s...@elyograg.org wrote: On 6/6/2014 6:25 AM, Aman Tandon wrote: Does this *shards* parameter will also work in near future with solr 5? I am not aware of any plan to deprecate or remove the shards parameter. My personal experience is with versions from 1.4.0 through 4.7.2. It works in all of those versions. Without SolrCloud, the shards parameter is the only way you can do a distributed search. Thanks, Shawn
Re: Solr-Distributed search
Thanks shawn. In my organisation we also want to implement the solrcloud, but the problem is that, we are using the master-slave architecture and on master we do all indexing, architecture of master is lower than the slaves. So if we implement the solrcloud in a fashion that master will be the leader, and slaves will be the replicas then in that case, in the case of high load leader can bear it, I guess every query firstly goes to leader then it distributes the request as i noticed from the logs and blogs :) As well as master is in NY and slaves are in Dallas, which also might cause latency issue and it will instead fail our purpose of faster query response. So i thought to use this shards parameter so that we query only from the replicas not to the leader so that leader just work fine. But we were not sure about this shards parameter, what do you think? what should we do with latency issue and shards parameter. With Regards Aman Tandon On Fri, Jun 6, 2014 at 7:24 PM, Shawn Heisey s...@elyograg.org wrote: On 6/6/2014 6:25 AM, Aman Tandon wrote: Does this *shards* parameter will also work in near future with solr 5? I am not aware of any plan to deprecate or remove the shards parameter. My personal experience is with versions from 1.4.0 through 4.7.2. It works in all of those versions. Without SolrCloud, the shards parameter is the only way you can do a distributed search. Thanks, Shawn
Re: Solr-Distributed search
On 6/6/2014 8:31 AM, Aman Tandon wrote: In my organisation we also want to implement the solrcloud, but the problem is that, we are using the master-slave architecture and on master we do all indexing, architecture of master is lower than the slaves. So if we implement the solrcloud in a fashion that master will be the leader, and slaves will be the replicas then in that case, in the case of high load leader can bear it, I guess every query firstly goes to leader then it distributes the request as i noticed from the logs and blogs :) As well as master is in NY and slaves are in Dallas, which also might cause latency issue and it will instead fail our purpose of faster query response. So i thought to use this shards parameter so that we query only from the replicas not to the leader so that leader just work fine. But we were not sure about this shards parameter, what do you think? what should we do with latency issue and shards parameter. SolrCloud does not yet have any way to prefer one set of replicas over the others, so if you just send it requests, they would be sent to both Dallas and New York, affecting search latency. Local replica preference is a desperately needed feature. Old-style distributed search with the shards parameter, combined with master/slave replication, is an effective way to be absolutely sure which servers you are querying. I would actually recommend that you get rid of replication and have your index updating software update each copy of the index independently. This is how I do my Solr install. It opens up a whole new set of possibilities -- you can change the schema and/or config on one set of servers, or upgrade any component -- Solr, Java, etc., without affecting the other set of servers at all. One note: in order for the indexing paradigm I've outlined to be actually effective, you must separately track which inserts/updates/deletes have been done for each server set. If you don't do that, they can get out of sync when you restart a server. Also, if you don't do this, having a server is down for an extended period of time might cause all indexing activity to stop on BOTH server sets. Thanks, Shawn
Re: Solr-Distributed search
Thanks shawn i will try to think in that way too :) With Regards Aman Tandon On Fri, Jun 6, 2014 at 8:19 PM, Shawn Heisey s...@elyograg.org wrote: On 6/6/2014 8:31 AM, Aman Tandon wrote: In my organisation we also want to implement the solrcloud, but the problem is that, we are using the master-slave architecture and on master we do all indexing, architecture of master is lower than the slaves. So if we implement the solrcloud in a fashion that master will be the leader, and slaves will be the replicas then in that case, in the case of high load leader can bear it, I guess every query firstly goes to leader then it distributes the request as i noticed from the logs and blogs :) As well as master is in NY and slaves are in Dallas, which also might cause latency issue and it will instead fail our purpose of faster query response. So i thought to use this shards parameter so that we query only from the replicas not to the leader so that leader just work fine. But we were not sure about this shards parameter, what do you think? what should we do with latency issue and shards parameter. SolrCloud does not yet have any way to prefer one set of replicas over the others, so if you just send it requests, they would be sent to both Dallas and New York, affecting search latency. Local replica preference is a desperately needed feature. Old-style distributed search with the shards parameter, combined with master/slave replication, is an effective way to be absolutely sure which servers you are querying. I would actually recommend that you get rid of replication and have your index updating software update each copy of the index independently. This is how I do my Solr install. It opens up a whole new set of possibilities -- you can change the schema and/or config on one set of servers, or upgrade any component -- Solr, Java, etc., without affecting the other set of servers at all. One note: in order for the indexing paradigm I've outlined to be actually effective, you must separately track which inserts/updates/deletes have been done for each server set. If you don't do that, they can get out of sync when you restart a server. Also, if you don't do this, having a server is down for an extended period of time might cause all indexing activity to stop on BOTH server sets. Thanks, Shawn
Solr-Distributed search
Hi, Can you please help me solr distribued search in multicore? i would be very happy as i am stuck here. In java code how do i implement distributed search? -- Thanks Regards Anurag Verma
Re: Solr-Distributed search
Hi, you can search using this sample Url http://localhost:8080/solr/core1/select?q=*:*shards=localhost:8080/solr/core1,localhost:8080/solr/core2,localhost:8080/solr/core3 Mahmoud Almokadem On Thu, Jun 5, 2014 at 8:13 AM, Anurag Verma vermanur...@gmail.com wrote: Hi, Can you please help me solr distribued search in multicore? i would be very happy as i am stuck here. In java code how do i implement distributed search? -- Thanks Regards Anurag Verma
Re: Solr Distributed Search vs Hadoop
Here is an example of schema design: a PDF file of 5MB might have maybe 50k of actual text. The Solr ExtractingRequestHandler will find that text and only index that. If you set the field to stored=true, the 5mb will be saved. If saved=false, the PDF is not saved. Instead, you would store a link to it. One problem with indexing is that Solr continally copies data into segments (index parts) while you index. So, each 5MB PDF might get copied 50 times during a full index job. If you can strip the index down to what you really want to search on, terabytes become gigabytes. Solr seems to handle 100g-200g fine on modern hardware. Lance On Fri, Dec 23, 2011 at 1:54 AM, Nick Vincent n...@vtype.com wrote: For data of this size you may want to look at something like Apache Cassandra, which is made specifically to handle data at this kind of scale across many machines. You can still use Hadoop to analyse and transform the data in a performant manner, however it's probably best to do some research on this on the relevant technical forums for those technologies. Nick -- Lance Norskog goks...@gmail.com
Re: Solr Distributed Search vs Hadoop
This copying is a bit overstated here because of the way that small segments are merged into larger segments. Those larger segments are then copied much less often than the smaller ones. While you can wind up with lots of copying in certain extreme cases, it is quite rare. In particular, if you have one of the following cases, you won't see very many copies for any particular document: - you don't delete files one at a time (i.e. indexing only without updates or deletion) or - most documents that are going to be deleted are deleted as young documents or - the probability that any particular document will be deleted in a fixed period of time decreases exponentially with the age of the documents Any of these characteristics or many others will prevent a file from being copied very many times because as the document ages, it keeps company with similarly aged documents which are accordingly unlikely to have enough compatriots deleted to make their segment have a small number of live documents in it. Put another way, the intervals between merges that a particular document undergoes will become longer and longer as it ages and thus the total number of copies it can undergo cannot grow very fast. On Wed, Dec 28, 2011 at 7:53 PM, Lance Norskog goks...@gmail.com wrote: ... One problem with indexing is that Solr continally copies data into segments (index parts) while you index. So, each 5MB PDF might get copied 50 times during a full index job. If you can strip the index down to what you really want to search on, terabytes become gigabytes. Solr seems to handle 100g-200g fine on modern hardware.
Re: Solr Distributed Search vs Hadoop
For data of this size you may want to look at something like Apache Cassandra, which is made specifically to handle data at this kind of scale across many machines. You can still use Hadoop to analyse and transform the data in a performant manner, however it's probably best to do some research on this on the relevant technical forums for those technologies. Nick
Solr Distributed Search vs Hadoop
Hi, I have a basic question, let's say we're going to have a very very huge set of data. In a way that for sure we will need many servers (tens or hundreds of servers). We will also need failover. Now the question is, if we should use Hadoop or using Solr Distributed Search with shards would be enough? I've read lots of articles like: http://www.lucidimagination.com/content/scaling-lucene-and-solr http://wiki.apache.org/solr/DistributedSearch But I'm still confused, Solr's distributed search seems to be able to handle splitting the queries and merging the result. So what's the point of using Hadoop? I'm pretty sure I'm missing something here. Can anyone suggest some links regarding this issue? Regards -- Alireza Salimi Java EE Developer
Re: Solr Distributed Search vs Hadoop
You didn't mention how big your data is or how you create it. Hadoop would mostly used in the preparation of the data or the off-line creation of indexes. On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi alireza.sal...@gmail.comwrote: Hi, I have a basic question, let's say we're going to have a very very huge set of data. In a way that for sure we will need many servers (tens or hundreds of servers). We will also need failover. Now the question is, if we should use Hadoop or using Solr Distributed Search with shards would be enough? I've read lots of articles like: http://www.lucidimagination.com/content/scaling-lucene-and-solr http://wiki.apache.org/solr/DistributedSearch But I'm still confused, Solr's distributed search seems to be able to handle splitting the queries and merging the result. So what's the point of using Hadoop? I'm pretty sure I'm missing something here. Can anyone suggest some links regarding this issue? Regards -- Alireza Salimi Java EE Developer
Re: Solr Distributed Search vs Hadoop
Well, actually we haven't started the actual project yet. But probably it will have to handle the data of millions of users, and a rough estimation for each user's data would be something around 5 MB. The other problem is that those data will be changed very often. I hope I answered your question. Thanks On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: You didn't mention how big your data is or how you create it. Hadoop would mostly used in the preparation of the data or the off-line creation of indexes. On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi alireza.sal...@gmail.comwrote: Hi, I have a basic question, let's say we're going to have a very very huge set of data. In a way that for sure we will need many servers (tens or hundreds of servers). We will also need failover. Now the question is, if we should use Hadoop or using Solr Distributed Search with shards would be enough? I've read lots of articles like: http://www.lucidimagination.com/content/scaling-lucene-and-solr http://wiki.apache.org/solr/DistributedSearch But I'm still confused, Solr's distributed search seems to be able to handle splitting the queries and merging the result. So what's the point of using Hadoop? I'm pretty sure I'm missing something here. Can anyone suggest some links regarding this issue? Regards -- Alireza Salimi Java EE Developer -- Alireza Salimi Java EE Developer
Re: Solr Distributed Search vs Hadoop
Well that begins to not look so much like a Solr/Lucene problem. Overall data is moderately large (TB's to 10's of TB's) for Lucene and the individual user profiles are distinctly large to be storing in Lucene. If there is part of the profile that you might want to search, that would be appropriate for Lucene. If you can split the user data into several components that are updated independently, then Hbase might be appropriate with different components in different column families. You aren't going to get a definitive answer on a mailing list, however. You are going to need somebody with a bit of experience to advise you directly and/or you are going to need to prototype test cases. On Tue, Dec 20, 2011 at 1:07 PM, Alireza Salimi alireza.sal...@gmail.comwrote: Well, actually we haven't started the actual project yet. But probably it will have to handle the data of millions of users, and a rough estimation for each user's data would be something around 5 MB. The other problem is that those data will be changed very often. I hope I answered your question. Thanks On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: You didn't mention how big your data is or how you create it. Hadoop would mostly used in the preparation of the data or the off-line creation of indexes. On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi alireza.sal...@gmail.comwrote: Hi, I have a basic question, let's say we're going to have a very very huge set of data. In a way that for sure we will need many servers (tens or hundreds of servers). We will also need failover. Now the question is, if we should use Hadoop or using Solr Distributed Search with shards would be enough? I've read lots of articles like: http://www.lucidimagination.com/content/scaling-lucene-and-solr http://wiki.apache.org/solr/DistributedSearch But I'm still confused, Solr's distributed search seems to be able to handle splitting the queries and merging the result. So what's the point of using Hadoop? I'm pretty sure I'm missing something here. Can anyone suggest some links regarding this issue? Regards -- Alireza Salimi Java EE Developer -- Alireza Salimi Java EE Developer
Re: Huge Performance: Solr distributed search
Interesting info. You should look into using Solid State Drives. I moved my search engine to SSD and saw dramatic improvements. -- View this message in context: http://lucene.472066.n3.nabble.com/Huge-Performance-Solr-distributed-search-tp3530627p346.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Huge Performance: Solr distributed search
Hi all again. Thanks to all for your replies. On this weekend I'd made some interesting tests, and I would like to share it with you. First of all I made speed test of my hdd: root@LSolr:~# hdparm -t /dev/sda9 /dev/sda9: Timing buffered disk reads: 146 MB in 3.01 seconds = 48.54 MB/sec Then with iperf I had tested my network: [ 4] 0.0-18.7 sec 2.00 GBytes917 Mbits/sec Then, I tried to post my quesries using shard parameter with one shard, so my queries were like: http://localhost:8080/solr1/select/?q=(test)qt=requestShards http://localhost:8080/solr1/select/?q=%28test%29qt=requestShards where requestShards is: requestHandler name=requestShards class=solr.SearchHandler default=false lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=shards127.0.0.1:8080/solr1 http://127.0.0.1:8080/solr1/str /lst /requestHandler Maybe its not correct, but: INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(genuflections)qt=requestShardsrows=2000}status=0 QTime=6525 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(tunefulness)qt=requestShardsrows=2000} status=0 QTime=20170 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(societal)qt=requestShardsrows=2000} status=0 QTime=44958 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(euchre's)qt=requestShardsrows=2000} status=0 QTime=32161 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(monogram's)qt=requestShardsrows=2000} status=0 QTime=85252 When I posted similar queries direct to solr1 without requestShards I had: INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(reopening)rows=2000} hits=712 status=0 QTime=10 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(housemothers)rows=2000} hits=0 status=0 QTime=446 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(harpooners)rows=2000} hits=76 status=0 QTime=399 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(coaxing)rows=2000} hits=562 status=0 QTime=2820 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(superstar's)rows=2000} hits=4748 status=0 QTime=672 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(sedateness's)rows=2000} hits=136 status=0 QTime=923 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(petrolatum)rows=2000} hits=8 status=0 QTime=6183 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(everlasting's)rows=2000} hits=1522 status=0 QTime=2625 And finally I found a bug: https://issues.apache.org/jira/browse/SOLR-1524 https://issues.apache.org/jira/browse/SOLR-1524 Why is no activity on it? Its not actual? Today I wrote a bash script: #!/bin/bash ds=$(date +%s.%N) echo START: $ds ./data/east_2000 curl http://127.0.0.1:8080/solr1/select/?fl=*,scoreident=truestart=0q=(east)rows=2000 http://127.0.0.1:8080/solr1/select/?fl=*,scoreident=truestart=0q=%28east%29rows=2000-s -s-H 'Content-type:text/xml; charset=utf-8' ./data/east_2000 de=$(date +%s.%N) ddf=$(echo $de - $ds | bc) echo END: $de ./data/east_2000 echo DIFF: $ddf ./data/east_2000 Before runing a Tomcat I'd dropped cache: root@LSolr:~# echo 3 /proc/sys/vm/drop_caches Then I started Tomcat and run the script. Result is bellow: START: 1322476131.783146691 ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime125/intlst name=paramsstr name=fl*,score/strstr name=identtrue/strstr name=start0/strstr name=q(east)/strstr name=rows2000/str/lst/lstresult name=response numFound=21439 start=0 maxScore=4.387605 ... /response END: 1322476180.262770244 DIFF: 48.479623553 File size is: root@LSolr:~# ls -l | grep east -rw-r--r-- 1 root root 1063579 Nov 28 12:29 east_2000 I'm using nmon to monitor a HDD activity. It was near 100% when I run the script. But when I tried to run it again the result was: DIFF: .063678709 and no much HDD activity at nmon. I can't undestand one thing: is this my huge hardware such as slow HDDor its a Solr troubles? And why is no activity on bug https://issues.apache.org/jira/browse/SOLR-1524 https://issues.apache.org/jira/browse/SOLR-1524 since 27/Oct/09 07:19? On 11/25/2011 10:02 AM, Dmitry Kan wrote: 45 000 000 per shard approx, Tomcat, caching was tweaked in solrconfig and shard given 12GB of RAM max. !-- Filter Cache Cache used by SolrIndexSearcher for filters (DocSets), unordered sets of *all* documents that match a query. When a new searcher is opened, its caches may be prepopulated or autowarmed using data from caches in the old searcher. autowarmCount is the number of items to prepopulate. For LRUCache, the autowarmed items will be
Re: Huge Performance: Solr distributed search
Problem has been resolved. My disk subsystem been a bottleneck for quick search. I put my indexes to RAM and I see very nice QTimes :) Sorry for your time, guys. On Mon, Nov 28, 2011 at 4:02 PM, Artem Lokotosh arco...@gmail.com wrote: Hi all again. Thanks to all for your replies. On this weekend I'd made some interesting tests, and I would like to share it with you. First of all I made speed test of my hdd: root@LSolr:~# hdparm -t /dev/sda9 /dev/sda9: Timing buffered disk reads: 146 MB in 3.01 seconds = 48.54 MB/sec Then with iperf I had tested my network: [ 4] 0.0-18.7 sec 2.00 GBytes 917 Mbits/sec Then, I tried to post my quesries using shard parameter with one shard, so my queries were like: http://localhost:8080/solr1/select/?q=(test)qt=requestShards http://localhost:8080/solr1/select/?q=%28test%29qt=requestShards where requestShards is: requestHandler name=requestShards class=solr.SearchHandler default=false lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=shards127.0.0.1:8080/solr1 http://127.0.0.1:8080/solr1/str /lst /requestHandler Maybe its not correct, but: INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(genuflections)qt=requestShardsrows=2000}status=0 QTime=6525 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(tunefulness)qt=requestShardsrows=2000} status=0 QTime=20170 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(societal)qt=requestShardsrows=2000} status=0 QTime=44958 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(euchre's)qt=requestShardsrows=2000} status=0 QTime=32161 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(monogram's)qt=requestShardsrows=2000} status=0 QTime=85252 When I posted similar queries direct to solr1 without requestShards I had: INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(reopening)rows=2000} hits=712 status=0 QTime=10 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(housemothers)rows=2000} hits=0 status=0 QTime=446 INFO: [] webapp=/solr1 path=/select/params={fl=*,scoreident=truestart=0q=(harpooners)rows=2000} hits=76 status=0 QTime=399 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(coaxing)rows=2000} hits=562 status=0 QTime=2820 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(superstar's)rows=2000} hits=4748 status=0 QTime=672 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(sedateness's)rows=2000} hits=136 status=0 QTime=923 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(petrolatum)rows=2000} hits=8 status=0 QTime=6183 INFO: [] webapp=/solr1 path=/select/ params={fl=*,scoreident=truestart=0q=(everlasting's)rows=2000} hits=1522 status=0 QTime=2625 And finally I found a bug: https://issues.apache.org/jira/browse/SOLR-1524 https://issues.apache.org/jira/browse/SOLR-1524 Why is no activity on it? Its not actual? Today I wrote a bash script: #!/bin/bash ds=$(date +%s.%N) echo START: $ds ./data/east_2000 curl http://127.0.0.1:8080/solr1/select/?fl=*,scoreident=truestart=0q=(east)rows=2000 http://127.0.0.1:8080/solr1/select/?fl=*,scoreident=truestart=0q=%28east%29rows=2000-s -s-H 'Content-type:text/xml; charset=utf-8' ./data/east_2000 de=$(date +%s.%N) ddf=$(echo $de - $ds | bc) echo END: $de ./data/east_2000 echo DIFF: $ddf ./data/east_2000 Before runing a Tomcat I'd dropped cache: root@LSolr:~# echo 3 /proc/sys/vm/drop_caches Then I started Tomcat and run the script. Result is bellow: START: 1322476131.783146691 ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime125/intlst name=paramsstr name=fl*,score/strstr name=identtrue/strstr name=start0/strstr name=q(east)/strstr name=rows2000/str/lst/lstresult name=response numFound=21439 start=0 maxScore=4.387605 ... /response END: 1322476180.262770244 DIFF: 48.479623553 File size is: root@LSolr:~# ls -l | grep east -rw-r--r-- 1 root root 1063579 Nov 28 12:29 east_2000 I'm using nmon to monitor a HDD activity. It was near 100% when I run the script. But when I tried to run it again the result was: DIFF: .063678709 and no much HDD activity at nmon. I can't undestand one thing: is this my huge hardware such as slow HDDor its a Solr troubles? And why is no activity on bug https://issues.apache.org/jira/browse/SOLR-1524 https://issues.apache.org/jira/browse/SOLR-1524 since 27/Oct/09 07:19? On 11/25/2011 10:02 AM, Dmitry Kan wrote: 45 000 000 per shard approx, Tomcat, caching was tweaked in solrconfig and shard given 12GB of RAM max. !-- Filter Cache Cache used by SolrIndexSearcher for filters (DocSets),
Re: Huge Performance: Solr distributed search
45 000 000 per shard approx, Tomcat, caching was tweaked in solrconfig and shard given 12GB of RAM max. !-- Filter Cache Cache used by SolrIndexSearcher for filters (DocSets), unordered sets of *all* documents that match a query. When a new searcher is opened, its caches may be prepopulated or autowarmed using data from caches in the old searcher. autowarmCount is the number of items to prepopulate. For LRUCache, the autowarmed items will be the most recently accessed items. Parameters: class - the SolrCache implementation LRUCache or (LRUCache or FastLRUCache) size - the maximum number of entries in the cache initialSize - the initial capacity (number of entries) of the cache. (see java.util.HashMap) autowarmCount - the number of entries to prepopulate from and old cache. -- filterCache class=solr.FastLRUCache size=1200 initialSize=1200 autowarmCount=128/ !-- Query Result Cache Caches results of searches - ordered lists of document ids (DocList) based on a query, a sort, and the range of documents requested. -- queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=32/ !-- Document Cache Caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. -- documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ !-- Field Value Cache Cache used to hold field values that are quickly accessible by document id. The fieldValueCache is created by default even if not configured here. -- !-- fieldValueCache class=solr.FastLRUCache size=512 autowarmCount=128 showItems=32 / -- !-- Custom Cache Example of a generic cache. These caches may be accessed by name through SolrIndexSearcher.getCache(),cacheLookup(), and cacheInsert(). The purpose is to enable easy caching of user/application level data. The regenerator argument should be specified as an implementation of solr.CacheRegenerator if autowarming is desired. -- !-- cache name=myUserCache class=solr.LRUCache size=4096 initialSize=1024 autowarmCount=1024 regenerator=com.mycompany.MyRegenerator / -- !-- Lazy Field Loading If true, stored fields that are not requested will be loaded lazily. This can result in a significant speed improvement if the usual case is to not load all stored fields, especially if the skipped fields are large compressed text fields. -- enableLazyFieldLoading true /enableLazyFieldLoading !-- Use Filter For Sorted Query A possible optimization that attempts to use a filter to satisfy a search. If the requested sort does not include score, then the filterCache will be checked for a filter matching the query. If found, the filter will be used as the source of document ids, and then the sort will be applied to that. For most situations, this will not be useful unless you frequently get the same search repeatedly with different sort options, and none of them ever use score -- !-- useFilterForSortedQuerytrue/useFilterForSortedQuery -- !-- Result Window Size An optimization for use with the queryResultCache. When a search is requested, a superset of the requested number of document ids are collected. For example, if a search for a particular query requests matching documents 10 through 19, and queryWindowSize is 50, then documents 0 through 49 will be collected and cached. Any further requests in that range can be satisfied via the cache. -- queryResultWindowSize 50 /queryResultWindowSize !-- Maximum number of documents to cache for any entry in the queryResultCache. -- queryResultMaxDocsCached 200 /queryResultMaxDocsCached In you case I would first check if the network throughput is a bottleneck. It would be nice if you could check timestamps of completing a request on each of the shards and arrival time (via some http sniffer) at the frondend SOLR's servers. Then you will see if it is frontend taking so much time or was it a network issue. Are you shards btw well balanced? On Thu, Nov 24, 2011 at 7:06 PM, Artem Lokotosh arco...@gmail.com wrote: Can you merge, e.g. 3 shards together or is it much effort for your team?Yes, we can merge. We'll try to do this and review how it will works Merge does not help :(I've tried to merge two shards in one, three shards in one, but results are
Re: Huge Performance: Solr distributed search
On 11/25/2011 3:13 AM, Mark Miller wrote: When you search each shard, are you positive that you are using all of the same parameters? You are sure you are hitting request handlers that are configured exactly the same and sending exactly the same queries? I'm my experience, the overhead for distrib search is usually very low. What types of queries are you trying? I'm using the simple queries like this http://192.168.1.90:9090/solr/select/?fl=*,scorestart=0q=(superstar)qt=requestShardsrows=2000 The requestShards handler defined as requestHandler name=requestShards class=solr.SearchHandler default=false lst name=defaults str name=shards 192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,...,192.168.1.85:8080/solr6, 192.168.1.86:8080/solr7,192.168.1.86:8080/solr8,...,192.168.1.86:8080/solr12, ..., 192.168.1.89:8080/solr25,192.168.1.89:8080/solr26,...,192.168.1.89:8080/solr30 /str int name=rows10/int /lst /requestHandler -- Best regards, Artem Lokotoshmailto:arco...@gmail.com
Re: Huge Performance: Solr distributed search
in general terms, when your Java heap is so large, it is beneficial to set mx and ms to the same size. On Wed, Nov 23, 2011 at 5:12 AM, Artem Lokotosh arco...@gmail.com wrote: Hi! * Data: - Solr 3.4; - 30 shards ~ 13GB, 27-29M docs each shard. * Machine parameters (Ubuntu 10.04 LTS): user@Solr:~$ uname -a Linux Solr 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux user@Solr:~$ cat /proc/cpuinfo processor : 0 - 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5690 @ 3.47GHz stepping : 2 cpu MHz : 3458.000 cache size : 12288 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat bogomips : 6916.00 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: user@Solr:~$ cat /proc/meminfo MemTotal: 16992680 kB MemFree: 110424 kB Buffers: 9976 kB Cached: 11588380 kB SwapCached: 41952 kB Active: 9860764 kB Inactive: 6198668 kB Active(anon): 4062144 kB Inactive(anon): 398972 kB Active(file): 5798620 kB Inactive(file): 5799696 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 46873592 kB SwapFree: 46810712 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 4424756 kB Mapped: 940660 kB Shmem: 40 kB Slab: 362344 kB SReclaimable: 350372 kB SUnreclaim: 11972 kB KernelStack: 2488 kB PageTables: 68568 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 55369932 kB Committed_AS: 5740556 kB VmallocTotal: 34359738367 kB VmallocUsed: 350532 kB VmallocChunk: 34359384964 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 10240 kB DirectMap2M: 17299456 kB - Apache Tomcat 6.0.32: !-- java arguments -- -XX:+DisableExplicitGC -XX:PermSize=512M -XX:MaxPermSize=512M -Xmx12G -Xms3G -XX:NewSize=128M -XX:MaxNewSize=128M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9 -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:/opt/search/tomcat/logs/gc.log Out search schema is: - 5 servers with configuration above; - one tomcat6 application on each server with 6 solr applications. - Full addresses are: 1) http://192.168.1.85:8080/solr1,http://192.168.1.85:8080/solr2,...,http://192.168.1.85:8080/solr6 2) http://192.168.1.86:8080/solr7,http://192.168.1.86:8080/solr8,...,http://192.168.1.86:8080/solr12 ... 5) http://192.168.1.89:8080/solr25,http://192.168.1.89:8080/solr26,...,http://192.168.1.89:8080/solr30 - At another server there is a additional common application with shards paramerter: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str str name=shards192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,...,192.168.1.89:8080/solr30/str int name=rows10/int /lst /requestHandler - schema and solrconfig are identical for all shards, for first shard see attach; - on these servers are only search, indexation is on another (optimized to 2 segments shards replicate with ssh/rsync scripts). So now the major problem is huge performance on distributed search. Take look on, for example, these logs: This is on 30 shards: INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(barium)rows=2000} status=0 QTime=40712 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(pittances)rows=2000} status=0 QTime=36097 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reliability)rows=2000} status=0 QTime=75756 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(blessing's)rows=2000} status=0 QTime=30342 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reiterated)rows=2000} status=0 QTime=55690 Sometimes QTime is more than 15. But when we run identical queries on one shard separately, QTime is between 200 and 1500. Does ditributed solr search really slow or our architecture is non optimal? Or maybe need to use any third-party applications? Thanks for any replies. -- Best regards, Artem
Re: Huge Performance: Solr distributed search
Can you merge, e.g. 3 shards together or is it much effort for your team?Yes, we can merge. We'll try to do this and review how it will works Merge does not help :(I've tried to merge two shards in one, three shards in one, but results are similar to results first configuration with 30 shardsbut this solution have an one big minus the optimization proccess may take more time In our setup we currently have 16 shards with ~30GB each, but we rarelysearch in all of them at once How many documents per shards in your setup?Any difference between Tomcat, Jetty or other? Have you configured your servlet more specifically than default configuration? On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh arco...@gmail.com wrote: Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? On Wed, Nov 23, 2011 at 4:01 PM, Dmitry Kan dmitry@gmail.com wrote: Hello, Is this log from the frontend SOLR (aggregator) or from a shard? Can you merge, e.g. 3 shards together or is it much effort for your team? In our setup we currently have 16 shards with ~30GB each, but we rarely search in all of them at once. Best, Dmitry On Wed, Nov 23, 2011 at 3:12 PM, Artem Lokotosh arco...@gmail.com wrote: -- Best regards, Artem Lokotosh mailto:arco...@gmail.com -- Best regards, Artem Lokotosh mailto:arco...@gmail.com
Re: Huge Performance: Solr distributed search
How big are the documents you return (how many fields, avg KB per doc, etc.)? I have a following schema in my solr configurationfieldsfield name=field1 type=text indexed=true stored=false/field name=field2 type=text indexed=true stored=true/field name=field3 type=text indexed=true stored=true/field name=field4 type=tlong indexed=true stored=true/field name=field5 type=tdate indexed=true stored=true/field name=field6 type=text indexed=true stored=true/field name=field7 type=text indexed=true stored=true/field name=field8 type=tlong indexed=true stored=true/field name=field9 type=text indexed=true stored=true/field name=field10 type=tdate indexed=true stored=true/field name=field11 type=text indexed=true stored=true/field name=id type=string indexed=true stored=true required=true//fields 27M–30M docs and 12-15 GB for each shard, 0.5KB per doc Does performance get much better if you only request top 100, or top10 documents instead of top 1000? |10 |100 | 1000 |2000 -|---||| MIN | 124 |146 |237 | 747 AVG | 832 | 4666 | 16130 | 72542 MAX | 3602 | 30197 | 57339 | 159482 QUERIES/5MIN |75 | 73 | 49 | 51 What if you only request a couple fields, instead of fl=*?What if you only search 10 shards instead of 30? Results are similar to table above, btw I need to recieve all fields from shards Another one problem.I use solrmeter or simple bash script to check the search speed.I've got QTime from 16K to 24K for first ~20 queriesfrom 50K to 100K for next ~20 queries and until servlet goes down On Wed, Nov 23, 2011 at 5:55 PM, Robert Stewart bstewart...@gmail.com wrote: If you request 1000 docs from each shard, then aggregator is really fetching 30,000 total documents, which then it must merge (re-sort results, and take top 1000 to return to client). Its possible that SOLR merging implementation needs optimized, but it does not seem like it could be that slow. How big are the documents you return (how many fields, avg KB per doc, etc.)? I would take a look at network to make sure that is not some bottleneck, and also to make sure there is not some underlying issue making 30 concurrent HTTP requests from the aggregator. I am not an expert in Java, but under .NET there is a setting that limits concurrent out-going HTTP requests from a process that must be over-ridden via configuration, otherwise by default is very limiting. Does performance get much better if you only request top 100, or top 10 documents instead of top 1000? What if you only request a couple fields, instead of fl=*? What if you only search 10 shards instead of 30? I would collect those numbers and try to determine if time increases linearly or not as you increase shards and/or # of docs. On Wed, Nov 23, 2011 at 9:55 AM, Artem Lokotosh arco...@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users?For now is not a problem, but we expect from 1K to 10K of concurrent users and maybe more On Wed, Nov 23, 2011 at 4:43 PM, Dmitry Kan dmitry@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users? On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh arco...@gmail.com wrote: Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? -- Best regards, Artem Lokotosh mailto:arco...@gmail.com -- Best regards, Artem Lokotosh mailto:arco...@gmail.com
Re: Huge Performance: Solr distributed search
On Thu, Nov 24, 2011 at 12:09 PM, Artem Lokotosh arco...@gmail.com wrote: How big are the documents you return (how many fields, avg KB per doc, etc.)? I have a following schema in my solr configurationfieldsfield name=field1 type=text indexed=true stored=false/field name=field2 type=text indexed=true stored=true/field name=field3 type=text indexed=true stored=true/field name=field4 type=tlong indexed=true stored=true/field name=field5 type=tdate indexed=true stored=true/field name=field6 type=text indexed=true stored=true/field name=field7 type=text indexed=true stored=true/field name=field8 type=tlong indexed=true stored=true/field name=field9 type=text indexed=true stored=true/field name=field10 type=tdate indexed=true stored=true/field name=field11 type=text indexed=true stored=true/field name=id type=string indexed=true stored=true required=true//fields 27M–30M docs and 12-15 GB for each shard, 0.5KB per doc Does performance get much better if you only request top 100, or top10 documents instead of top 1000? |10 |100 | 1000 |2000 -|---||| MIN | 124 |146 |237 | 747 AVG | 832 | 4666 | 16130 | 72542 MAX | 3602 | 30197 | 57339 | 159482 QUERIES/5MIN |75 | 73 | 49 | 51 What if you only request a couple fields, instead of fl=*?What if you only search 10 shards instead of 30? Results are similar to table above, btw I need to recieve all fields from shards Another one problem.I use solrmeter or simple bash script to check the search speed.I've got QTime from 16K to 24K for first ~20 queriesfrom 50K to 100K for next ~20 queries and until servlet goes down On Wed, Nov 23, 2011 at 5:55 PM, Robert Stewart bstewart...@gmail.com wrote: If you request 1000 docs from each shard, then aggregator is really fetching 30,000 total documents, which then it must merge (re-sort results, and take top 1000 to return to client). Its possible that SOLR merging implementation needs optimized, but it does not seem like it could be that slow. How big are the documents you return (how many fields, avg KB per doc, etc.)? I would take a look at network to make sure that is not some bottleneck, and also to make sure there is not some underlying issue making 30 concurrent HTTP requests from the aggregator. I am not an expert in Java, but under .NET there is a setting that limits concurrent out-going HTTP requests from a process that must be over-ridden via configuration, otherwise by default is very limiting. Does performance get much better if you only request top 100, or top 10 documents instead of top 1000? What if you only request a couple fields, instead of fl=*? What if you only search 10 shards instead of 30? I would collect those numbers and try to determine if time increases linearly or not as you increase shards and/or # of docs. On Wed, Nov 23, 2011 at 9:55 AM, Artem Lokotosh arco...@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users?For now is not a problem, but we expect from 1K to 10K of concurrent users and maybe more On Wed, Nov 23, 2011 at 4:43 PM, Dmitry Kan dmitry@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users? On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh arco...@gmail.com wrote: Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? -- Best regards, Artem Lokotoshmailto:arco...@gmail.com -- Best regards, Artem Lokotoshmailto:arco...@gmail.com When you search each shard, are you positive that you are using all of the same parameters? You are sure you are hitting request handlers that are configured exactly the same and sending exactly the same queries? I'm my experience, the overhead for distrib search is usually very low. What types of queries are you trying? -- - Mark http://www.lucidimagination.com
Huge Performance: Solr distributed search
Hi! * Data: - Solr 3.4; - 30 shards ~ 13GB, 27-29M docs each shard. * Machine parameters (Ubuntu 10.04 LTS): user@Solr:~$ uname -a Linux Solr 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux user@Solr:~$ cat /proc/cpuinfo processor : 0 - 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5690 @ 3.47GHz stepping: 2 cpu MHz : 3458.000 cache size : 12288 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat bogomips: 6916.00 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: user@Solr:~$ cat /proc/meminfo MemTotal: 16992680 kB MemFree: 110424 kB Buffers:9976 kB Cached: 11588380 kB SwapCached:41952 kB Active: 9860764 kB Inactive:6198668 kB Active(anon):4062144 kB Inactive(anon): 398972 kB Active(file):5798620 kB Inactive(file): 5799696 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 46873592 kB SwapFree: 46810712 kB Dirty:36 kB Writeback: 0 kB AnonPages: 4424756 kB Mapped: 940660 kB Shmem:40 kB Slab: 362344 kB SReclaimable: 350372 kB SUnreclaim:11972 kB KernelStack:2488 kB PageTables:68568 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:55369932 kB Committed_AS:5740556 kB VmallocTotal: 34359738367 kB VmallocUsed: 350532 kB VmallocChunk: 34359384964 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 10240 kB DirectMap2M:17299456 kB - Apache Tomcat 6.0.32: !-- java arguments -- -XX:+DisableExplicitGC -XX:PermSize=512M -XX:MaxPermSize=512M -Xmx12G -Xms3G -XX:NewSize=128M -XX:MaxNewSize=128M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9 -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:/opt/search/tomcat/logs/gc.log Out search schema is: - 5 servers with configuration above; - one tomcat6 application on each server with 6 solr applications. - Full addresses are: 1) http://192.168.1.85:8080/solr1,http://192.168.1.85:8080/solr2,...,http://192.168.1.85:8080/solr6 2) http://192.168.1.86:8080/solr7,http://192.168.1.86:8080/solr8,...,http://192.168.1.86:8080/solr12 ... 5) http://192.168.1.89:8080/solr25,http://192.168.1.89:8080/solr26,...,http://192.168.1.89:8080/solr30 - At another server there is a additional common application with shards paramerter: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str str name=shards192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,...,192.168.1.89:8080/solr30/str int name=rows10/int /lst /requestHandler - schema and solrconfig are identical for all shards, for first shard see attach; - on these servers are only search, indexation is on another (optimized to 2 segments shards replicate with ssh/rsync scripts). So now the major problem is huge performance on distributed search. Take look on, for example, these logs: This is on 30 shards: INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(barium)rows=2000} status=0 QTime=40712 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(pittances)rows=2000} status=0 QTime=36097 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reliability)rows=2000} status=0 QTime=75756 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(blessing's)rows=2000} status=0 QTime=30342 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reiterated)rows=2000} status=0 QTime=55690 Sometimes QTime is more than 15. But when we run identical queries on one shard separately, QTime is between 200 and 1500. Does ditributed solr search really slow or our architecture is non optimal? Or maybe need to use any third-party applications? Thanks for any replies. -- Best regards, Artem
Re: Huge Performance: Solr distributed search
Hello, Is this log from the frontend SOLR (aggregator) or from a shard? Can you merge, e.g. 3 shards together or is it much effort for your team? In our setup we currently have 16 shards with ~30GB each, but we rarely search in all of them at once. Best, Dmitry On Wed, Nov 23, 2011 at 3:12 PM, Artem Lokotosh arco...@gmail.com wrote: Hi! * Data: - Solr 3.4; - 30 shards ~ 13GB, 27-29M docs each shard. * Machine parameters (Ubuntu 10.04 LTS): user@Solr:~$ uname -a Linux Solr 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux user@Solr:~$ cat /proc/cpuinfo processor : 0 - 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5690 @ 3.47GHz stepping: 2 cpu MHz : 3458.000 cache size : 12288 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat bogomips: 6916.00 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: user@Solr:~$ cat /proc/meminfo MemTotal: 16992680 kB MemFree: 110424 kB Buffers:9976 kB Cached: 11588380 kB SwapCached:41952 kB Active: 9860764 kB Inactive:6198668 kB Active(anon):4062144 kB Inactive(anon): 398972 kB Active(file):5798620 kB Inactive(file): 5799696 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 46873592 kB SwapFree: 46810712 kB Dirty:36 kB Writeback: 0 kB AnonPages: 4424756 kB Mapped: 940660 kB Shmem:40 kB Slab: 362344 kB SReclaimable: 350372 kB SUnreclaim:11972 kB KernelStack:2488 kB PageTables:68568 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:55369932 kB Committed_AS:5740556 kB VmallocTotal: 34359738367 kB VmallocUsed: 350532 kB VmallocChunk: 34359384964 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 10240 kB DirectMap2M:17299456 kB - Apache Tomcat 6.0.32: !-- java arguments -- -XX:+DisableExplicitGC -XX:PermSize=512M -XX:MaxPermSize=512M -Xmx12G -Xms3G -XX:NewSize=128M -XX:MaxNewSize=128M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9 -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:/opt/search/tomcat/logs/gc.log Out search schema is: - 5 servers with configuration above; - one tomcat6 application on each server with 6 solr applications. - Full addresses are: 1) http://192.168.1.85:8080/solr1,http://192.168.1.85:8080/solr2,..., http://192.168.1.85:8080/solr6 2) http://192.168.1.86:8080/solr7,http://192.168.1.86:8080/solr8,..., http://192.168.1.86:8080/solr12 ... 5) http://192.168.1.89:8080/solr25,http://192.168.1.89:8080/solr26,..., http://192.168.1.89:8080/solr30 - At another server there is a additional common application with shards paramerter: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str str name=shards192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,..., 192.168.1.89:8080/solr30/str int name=rows10/int /lst /requestHandler - schema and solrconfig are identical for all shards, for first shard see attach; - on these servers are only search, indexation is on another (optimized to 2 segments shards replicate with ssh/rsync scripts). So now the major problem is huge performance on distributed search. Take look on, for example, these logs: This is on 30 shards: INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(barium)rows=2000} status=0 QTime=40712 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(pittances)rows=2000} status=0 QTime=36097 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reliability)rows=2000} status=0 QTime=75756 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(blessing's)rows=2000} status=0 QTime=30342 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reiterated)rows=2000} status=0 QTime=55690 Sometimes QTime is more than 15. But when we run identical queries on one shard separately, QTime is between 200 and 1500. Does ditributed solr search really slow or our architecture
Re: Huge Performance: Solr distributed search
Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? On Wed, Nov 23, 2011 at 4:01 PM, Dmitry Kan dmitry@gmail.com wrote: Hello, Is this log from the frontend SOLR (aggregator) or from a shard? Can you merge, e.g. 3 shards together or is it much effort for your team? In our setup we currently have 16 shards with ~30GB each, but we rarely search in all of them at once. Best, Dmitry On Wed, Nov 23, 2011 at 3:12 PM, Artem Lokotosh arco...@gmail.com wrote: Hi! * Data: - Solr 3.4; - 30 shards ~ 13GB, 27-29M docs each shard. * Machine parameters (Ubuntu 10.04 LTS): user@Solr:~$ uname -a Linux Solr 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux user@Solr:~$ cat /proc/cpuinfo processor : 0 - 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5690 @ 3.47GHz stepping : 2 cpu MHz : 3458.000 cache size : 12288 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat bogomips : 6916.00 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: user@Solr:~$ cat /proc/meminfo MemTotal: 16992680 kB MemFree: 110424 kB Buffers: 9976 kB Cached: 11588380 kB SwapCached: 41952 kB Active: 9860764 kB Inactive: 6198668 kB Active(anon): 4062144 kB Inactive(anon): 398972 kB Active(file): 5798620 kB Inactive(file): 5799696 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 46873592 kB SwapFree: 46810712 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 4424756 kB Mapped: 940660 kB Shmem: 40 kB Slab: 362344 kB SReclaimable: 350372 kB SUnreclaim: 11972 kB KernelStack: 2488 kB PageTables: 68568 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 55369932 kB Committed_AS: 5740556 kB VmallocTotal: 34359738367 kB VmallocUsed: 350532 kB VmallocChunk: 34359384964 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 10240 kB DirectMap2M: 17299456 kB - Apache Tomcat 6.0.32: !-- java arguments -- -XX:+DisableExplicitGC -XX:PermSize=512M -XX:MaxPermSize=512M -Xmx12G -Xms3G -XX:NewSize=128M -XX:MaxNewSize=128M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9 -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:/opt/search/tomcat/logs/gc.log Out search schema is: - 5 servers with configuration above; - one tomcat6 application on each server with 6 solr applications. - Full addresses are: 1) http://192.168.1.85:8080/solr1,http://192.168.1.85:8080/solr2,..., http://192.168.1.85:8080/solr6 2) http://192.168.1.86:8080/solr7,http://192.168.1.86:8080/solr8,..., http://192.168.1.86:8080/solr12 ... 5) http://192.168.1.89:8080/solr25,http://192.168.1.89:8080/solr26,..., http://192.168.1.89:8080/solr30 - At another server there is a additional common application with shards paramerter: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str str name=shards192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,..., 192.168.1.89:8080/solr30/str int name=rows10/int /lst /requestHandler - schema and solrconfig are identical for all shards, for first shard see attach; - on these servers are only search, indexation is on another (optimized to 2 segments shards replicate with ssh/rsync scripts). So now the major problem is huge performance on distributed search. Take look on, for example, these logs: This is on 30 shards: INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(barium)rows=2000} status=0 QTime=40712 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(pittances)rows=2000} status=0 QTime=36097 INFO: [] webapp=/solr path=/select/params={fl=*,scoreident=truestart=0q=(reliability)rows=2000} status=0 QTime=75756 INFO: [] webapp=/solr
Re: Huge Performance: Solr distributed search
If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users? On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh arco...@gmail.com wrote: Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? On Wed, Nov 23, 2011 at 4:01 PM, Dmitry Kan dmitry@gmail.com wrote: Hello, Is this log from the frontend SOLR (aggregator) or from a shard? Can you merge, e.g. 3 shards together or is it much effort for your team? In our setup we currently have 16 shards with ~30GB each, but we rarely search in all of them at once. Best, Dmitry On Wed, Nov 23, 2011 at 3:12 PM, Artem Lokotosh arco...@gmail.com wrote: Hi! * Data: - Solr 3.4; - 30 shards ~ 13GB, 27-29M docs each shard. * Machine parameters (Ubuntu 10.04 LTS): user@Solr:~$ uname -a Linux Solr 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011 x86_64 GNU/Linux user@Solr:~$ cat /proc/cpuinfo processor : 0 - 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5690 @ 3.47GHz stepping: 2 cpu MHz : 3458.000 cache size : 12288 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat bogomips: 6916.00 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: user@Solr:~$ cat /proc/meminfo MemTotal: 16992680 kB MemFree: 110424 kB Buffers:9976 kB Cached: 11588380 kB SwapCached:41952 kB Active: 9860764 kB Inactive:6198668 kB Active(anon):4062144 kB Inactive(anon): 398972 kB Active(file):5798620 kB Inactive(file): 5799696 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 46873592 kB SwapFree: 46810712 kB Dirty:36 kB Writeback: 0 kB AnonPages: 4424756 kB Mapped: 940660 kB Shmem:40 kB Slab: 362344 kB SReclaimable: 350372 kB SUnreclaim:11972 kB KernelStack:2488 kB PageTables:68568 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:55369932 kB Committed_AS:5740556 kB VmallocTotal: 34359738367 kB VmallocUsed: 350532 kB VmallocChunk: 34359384964 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 10240 kB DirectMap2M:17299456 kB - Apache Tomcat 6.0.32: !-- java arguments -- -XX:+DisableExplicitGC -XX:PermSize=512M -XX:MaxPermSize=512M -Xmx12G -Xms3G -XX:NewSize=128M -XX:MaxNewSize=128M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9 -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:/opt/search/tomcat/logs/gc.log Out search schema is: - 5 servers with configuration above; - one tomcat6 application on each server with 6 solr applications. - Full addresses are: 1) http://192.168.1.85:8080/solr1,http://192.168.1.85:8080/solr2,..., http://192.168.1.85:8080/solr6 2) http://192.168.1.86:8080/solr7,http://192.168.1.86:8080/solr8,..., http://192.168.1.86:8080/solr12 ... 5) http://192.168.1.89:8080/solr25,http://192.168.1.89:8080/solr26,..., http://192.168.1.89:8080/solr30 - At another server there is a additional common application with shards paramerter: requestHandler name=search class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str str name=shards192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,..., 192.168.1.89:8080/solr30/str int name=rows10/int /lst /requestHandler - schema and solrconfig are identical for all shards, for first shard see attach; - on these servers are only search, indexation is on another (optimized to 2 segments shards replicate with ssh/rsync scripts). So now the major problem is huge performance on distributed search. Take look on, for example, these logs: This is on 30 shards: INFO: [] webapp=/solr
Re: Huge Performance: Solr distributed search
If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users?For now is not a problem, but we expect from 1K to 10K of concurrent users and maybe more On Wed, Nov 23, 2011 at 4:43 PM, Dmitry Kan dmitry@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users? On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh arco...@gmail.com wrote: Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? -- Best regards, Artem Lokotosh mailto:arco...@gmail.com
Re: Huge Performance: Solr distributed search
If you request 1000 docs from each shard, then aggregator is really fetching 30,000 total documents, which then it must merge (re-sort results, and take top 1000 to return to client). Its possible that SOLR merging implementation needs optimized, but it does not seem like it could be that slow. How big are the documents you return (how many fields, avg KB per doc, etc.)? I would take a look at network to make sure that is not some bottleneck, and also to make sure there is not some underlying issue making 30 concurrent HTTP requests from the aggregator. I am not an expert in Java, but under .NET there is a setting that limits concurrent out-going HTTP requests from a process that must be over-ridden via configuration, otherwise by default is very limiting. Does performance get much better if you only request top 100, or top 10 documents instead of top 1000? What if you only request a couple fields, instead of fl=*? What if you only search 10 shards instead of 30? I would collect those numbers and try to determine if time increases linearly or not as you increase shards and/or # of docs. On Wed, Nov 23, 2011 at 9:55 AM, Artem Lokotosh arco...@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users?For now is not a problem, but we expect from 1K to 10K of concurrent users and maybe more On Wed, Nov 23, 2011 at 4:43 PM, Dmitry Kan dmitry@gmail.com wrote: If the response time from each shard shows decent figures, then aggregator seems to be a bottleneck. Do you btw have a lot of concurrent users? On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh arco...@gmail.com wrote: Is this log from the frontend SOLR (aggregator) or from a shard? from aggregator Can you merge, e.g. 3 shards together or is it much effort for your team? Yes, we can merge. We'll try to do this and review how it will works Thanks, Dmitry Any another ideas? -- Best regards, Artem Lokotosh mailto:arco...@gmail.com
About solr distributed search
Hi all, Now I'm doing research on solr distributed search, and it is said documents more than one million is reasonable to use distributed search. So I want to know, does anyone have the test result(Such as time cost) of using single index and distributed search of more than one million data? I need the test result very urgent, thanks in advance! Best Regards, Pengkai
RE: About solr distributed search
I am no expert, but here is my take and our situation. Firstly, are you asking what the minimum number of documents is before it makes *any* sense at all to use a distributed search, or are you asking what the maximum number of documents is before a distributed search is essentially required? The answers would be different. I get the feeling you are asking the second question, so I'll proceed under that assumption. I expect that in part the answer is it depends. I expect that it is mostly a function of the size of the index (and the interaction between that and memory and search performance), which depends on both the number of documents and how much is stored for the documents. It also would depend upon your update load. If the documents are small and/or the amount of stuff you store per document is small , then until the number of documents and/or updates gets truly enormous a single machine will probably be fine. But, if your documents (the amount stored per document) is very large, then at some point the index files get so large that performance on a single machine isn't adequate. Alternatively, if your update load is very very large, you might need to spread out that load among multiple servers to handle the update load without crippling your ability to respond to queries. As for a specific instance, we have a single index of 7 Million (going on 28 Million), with maybe 512 bytes of data stored for each document, with maybe 26 or so indexed fields (we have a *lot* of copyField operations in order to index the data the way we want it, yet preserve the original data to return), and did not need to use distributed search. JRJ -Original Message- From: Pengkai Qin [mailto:qin19890...@163.com] Sent: Thursday, September 29, 2011 5:15 AM To: solr-user@lucene.apache.org; d...@lucene.apache.org Subject: About solr distributed search Hi all, Now I'm doing research on solr distributed search, and it is said documents more than one million is reasonable to use distributed search. So I want to know, does anyone have the test result(Such as time cost) of using single index and distributed search of more than one million data? I need the test result very urgent, thanks in advance! Best Regards, Pengkai
Re: About solr distributed search
Hi Pengkai, my experience is based on http://www.findfiles.net/ which holds 700 Mio documents, each about 2kb size. A single Index containing that kind of data should hold below 80 Mio documents. In case you have complex queries with lots of facets, sorting, function queries then even 50 Mio documents per index could be your upper limit. On very fast Hardware and warmed index you might deliver results on average within 1 second. For documents above 5kb in size those numbers might not necessarily be the same. Try to test your documents by creating (NOT COPYING) and index them in vast numbers. After every 10 Mio documents test the average response time with caches switched off. If the average response time hits your threshold, then the number of documents in index is your limit per index. Scaling up is no problem. AFAIK 20 to 50 indexes should be fine within a distributed productive system. Kind Regards Gregor On 09/29/2011 12:14 PM, Pengkai Qin wrote: Hi all, Now I'm doing research on solr distributed search, and it is said documents more than one million is reasonable to use distributed search. So I want to know, does anyone have the test result(Such as time cost) of using single index and distributed search of more than one million data? I need the test result very urgent, thanks in advance! Best Regards, Pengkai -- How to find files on the Internet? FindFiles.net http://findfiles.net!
About solr distributed search
Hi all, Now I'm doing research on solr distributed search, and it is said documents more than one million is reasonable to use distributed search. So I want to know, does anyone have the test result(Such as time cost) of using single index and distributed search of more than one million data? I need the test result very urgent, thanks in advance! Best Regards, Pengkai
Re: About solr distributed search
hi 建议你自己搭个环境测试一下吧,1M这点儿数据一点儿问题没有 2011/9/30 秦鹏凯 qinpeng...@yahoo.cn: Hi all, Now I'm doing research on solr distributed search, and it is said documents more than one million is reasonable to use distributed search. So I want to know, does anyone have the test result(Such as time cost) of using single index and distributed search of more than one million data? I need the test result very urgent, thanks in advance! Best Regards, Pengkai --
Re: solr distributed search don't work
requestHandler name=MYREQUESTHANDLER class=solr.SearchHandler !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str str name=facet.methodenum/str str name=facet.mincount1/str str name=facet.limit10/str str name=shards192.168.1.6/solr/,192.168.1.7/solr//str /lst /requestHandler 2011/8/19 Li Li fancye...@gmail.com could you please show me your configuration in solrconfig.xml? On Fri, Aug 19, 2011 at 5:31 PM, olivier sallou olivier.sal...@gmail.com wrote: Hi, I do not use spell but I use distributed search, using qt=spell is correct, should not use qt=\spell. For shards, I specify it in solrconfig directly, not in url, but should work the same. Maybe an issue in your spell request handler. 2011/8/19 Li Li fancye...@gmail.com hi all, I follow the wiki http://wiki.apache.org/solr/SpellCheckComponent but there is something wrong. the url given my the wiki is http://solr:8983/solr/select?q=*:*spellcheck=truespellcheck.build=truespellcheck.q=toyataqt=spellshards.qt=spellshards=solr-shard1:8983/solr,solr-shard2:8983/solr but it does not work. I trace the codes and find that qt=spellshards.qt=spell should be qt=/spellshards.qt=/spell After modification of url, It return all documents but nothing about spell check. I debug it and find the AbstractLuceneSpellChecker.getSuggestions() is called.
solr distributed search don't work
hi all, I follow the wiki http://wiki.apache.org/solr/SpellCheckComponent but there is something wrong. the url given my the wiki is http://solr:8983/solr/select?q=*:*spellcheck=truespellcheck.build=truespellcheck.q=toyataqt=spellshards.qt=spellshards=solr-shard1:8983/solr,solr-shard2:8983/solr but it does not work. I trace the codes and find that qt=spellshards.qt=spell should be qt=/spellshards.qt=/spell After modification of url, It return all documents but nothing about spell check. I debug it and find the AbstractLuceneSpellChecker.getSuggestions() is called.
Re: solr distributed search don't work
Hi, I do not use spell but I use distributed search, using qt=spell is correct, should not use qt=\spell. For shards, I specify it in solrconfig directly, not in url, but should work the same. Maybe an issue in your spell request handler. 2011/8/19 Li Li fancye...@gmail.com hi all, I follow the wiki http://wiki.apache.org/solr/SpellCheckComponent but there is something wrong. the url given my the wiki is http://solr:8983/solr/select?q=*:*spellcheck=truespellcheck.build=truespellcheck.q=toyataqt=spellshards.qt=spellshards=solr-shard1:8983/solr,solr-shard2:8983/solr but it does not work. I trace the codes and find that qt=spellshards.qt=spell should be qt=/spellshards.qt=/spell After modification of url, It return all documents but nothing about spell check. I debug it and find the AbstractLuceneSpellChecker.getSuggestions() is called.
Re: solr distributed search don't work
could you please show me your configuration in solrconfig.xml? On Fri, Aug 19, 2011 at 5:31 PM, olivier sallou olivier.sal...@gmail.com wrote: Hi, I do not use spell but I use distributed search, using qt=spell is correct, should not use qt=\spell. For shards, I specify it in solrconfig directly, not in url, but should work the same. Maybe an issue in your spell request handler. 2011/8/19 Li Li fancye...@gmail.com hi all, I follow the wiki http://wiki.apache.org/solr/SpellCheckComponent but there is something wrong. the url given my the wiki is http://solr:8983/solr/select?q=*:*spellcheck=truespellcheck.build=truespellcheck.q=toyataqt=spellshards.qt=spellshards=solr-shard1:8983/solr,solr-shard2:8983/solr but it does not work. I trace the codes and find that qt=spellshards.qt=spell should be qt=/spellshards.qt=/spell After modification of url, It return all documents but nothing about spell check. I debug it and find the AbstractLuceneSpellChecker.getSuggestions() is called.
Re: a bug of solr distributed search
On Tue, 2010-10-26 at 15:48 +0200, Ron Mayer wrote: And a third potential reason - it's arguably a feature instead of a bug for some applications. Depending on how I organize my shards, give me the most relevant document from each shard for this search seems like it could be useful. You can get that even if the shards scored equally, so it is a limitation, not a feature. I hope to find the time later this week to read some of the papers Andrzej was kind enough to point out, but it seems like I really need to do the heavy lifting of setting up comparisons for our own material. The problem is of course to judge the quality of the outputs, but setting the single index as the norm and plotting the differences in document positions in the result sets might provide some insight. Regards, Toke Eskildsen
Re: a bug of solr distributed search
Andrzej Bialecki wrote: On 2010-10-25 11:22, Toke Eskildsen wrote: On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. Bingo. I really don't understand why this fundamental problem with sharding isn't mentioned more often. Every time the advice use sharding is given, it should be followed with a but be aware that it will make relevance ranking unreliable. The reason is twofold, I think: And a third potential reason - it's arguably a feature instead of a bug for some applications. Depending on how I organize my shards, give me the most relevant document from each shard for this search seems like it could be useful. * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. However, this means that now for every query you need to make two calls instead of one, which potentially doubles the time to return results (for simple common queries - for rare complex queries the time will be still dominated by the query runtime on shard servers). * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. So then the question is whether you can accept an N% variance in scores across shards, or whether you want to bear the cost of an additional distributed RPC for every query... To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO.
Re: a bug of solr distributed search
On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. Bingo. I really don't understand why this fundamental problem with sharding isn't mentioned more often. Every time the advice use sharding is given, it should be followed with a but be aware that it will make relevance ranking unreliable. Regards, Toke Eskildsen
Re: a bug of solr distributed search
On 2010-10-25 11:22, Toke Eskildsen wrote: On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. Bingo. I really don't understand why this fundamental problem with sharding isn't mentioned more often. Every time the advice use sharding is given, it should be followed with a but be aware that it will make relevance ranking unreliable. The reason is twofold, I think: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. However, this means that now for every query you need to make two calls instead of one, which potentially doubles the time to return results (for simple common queries - for rare complex queries the time will be still dominated by the query runtime on shard servers). * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. So then the question is whether you can accept an N% variance in scores across shards, or whether you want to bear the cost of an additional distributed RPC for every query... To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: a bug of solr distributed search
On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. I must admit that I have not tried the patch myself. Looking at https://issues.apache.org/jira/browse/SOLR-1632 i see that the last comment is from LiLi with a failed patch, but as there are no further comments it is unclear if the problem is general or just with LiLi's setup. I might be a bit harsh here, but the other comments for the JIRA issue also indicate that one would have to be somewhat adventurous to run this in production. * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. While I agree on the validity of the solution, it does put some serious constraints on the shard-setup. To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO. Do you know of any studies of the differences in ranking with regard to indexing-distribution by hashing, logical grouping and distributed IDF? Regards, Toke Eskildsen
Re: a bug of solr distributed search
On 2010-10-25 13:37, Toke Eskildsen wrote: On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. I must admit that I have not tried the patch myself. Looking at https://issues.apache.org/jira/browse/SOLR-1632 i see that the last comment is from LiLi with a failed patch, but as there are no further comments it is unclear if the problem is general or just with LiLi's setup. I might be a bit harsh here, but the other comments for the JIRA issue also indicate that one would have to be somewhat adventurous to run this in production. Oh, definitely this is not production quality yet - there are known bugs, for example, that I need to fix, and then it needs to be forward-ported to trunk. It shouldn't be too much work to bring it back into usable state. * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. While I agree on the validity of the solution, it does put some serious constraints on the shard-setup. True. But this is the simplest setup that just may be enough. To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO. Do you know of any studies of the differences in ranking with regard to indexing-distribution by hashing, logical grouping and distributed IDF? Unfortunately, this information is surprisingly scarce - research predating year 2000 is often not applicable, and most current research concentrates on P2P systems, which are really a different ball of wax. Here's a few papers that I found that are related to this issue: * Global Term Weights in Distributed Environments, H. Witschel, 2007 (Elsevier) * KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel, P. Triantafillou, G. Weikum, VLDB'05 (ACM) * Exploring the Stability of IDF Term Weighting, Xin Fu and Miao Chen, 2008 (Springer Verlag) * A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM) * Comparison of dierent Collection Fusion Models in Distributed Information Retrieval, Alexander Steidinger - this paper gives a nice comparison framework for different strategies for joining partial results; apparently we use the most primitive strategy explained there, based on raw scores... These papers likely don't fully answer your question, but at least they provide a broader picture of the issue... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: a bug of solr distributed search
Good morning, https://issues.apache.org/jira/browse/SOLR-1632 - Mitch Li Li wrote: where is the link of this patch? 2010/7/24 Yonik Seeley yo...@lucidimagination.com: On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote: why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. There's already a patch in JIRA that does distributed IDF. Hadoop wouldn't be the right tool for that anyway... it's for batch oriented systems, not low-latency queries. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. That only works if the docs are exactly the same - they may not be. -Yonik http://www.lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p995407.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
where is the link of this patch? 2010/7/24 Yonik Seeley yo...@lucidimagination.com: On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote: why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. There's already a patch in JIRA that does distributed IDF. Hadoop wouldn't be the right tool for that anyway... it's for batch oriented systems, not low-latency queries. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. That only works if the docs are exactly the same - they may not be. -Yonik http://www.lucidimagination.com
Re: a bug of solr distributed search
the solr version I used is 1.4 2010/7/26 Li Li fancye...@gmail.com: where is the link of this patch? 2010/7/24 Yonik Seeley yo...@lucidimagination.com: On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote: why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. There's already a patch in JIRA that does distributed IDF. Hadoop wouldn't be the right tool for that anyway... it's for batch oriented systems, not low-latency queries. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. That only works if the docs are exactly the same - they may not be. -Yonik http://www.lucidimagination.com
Re: a bug of solr distributed search
Okay, but than LiLi did something wrong, right? I mean, if the document exists only at one shard, it should get the same score whenever one requests it, no? Of course, this only applies if nothing gets changed between the requests. The only remaining problem here would be, that you need distributed IDF (like at the mentioned JIRA-issue) to normalize your results's scoring. But the mentioned problem at this mailing-list-posting has nothing to do with that... Regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p991907.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Yonik, why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. Even if we got large indices with 10 million or more unique terms, this will only need some megabyte network-traffic. Kind regards, - Mitch Yonik Seeley-2-2 wrote: As the comments suggest, it's not a bug, but just the best we can do for now since our priority queues don't support removal of arbitrary elements. I guess we could rebuild the current priority queue if we detect a duplicate, but that will have an obvious performance impact. Any other suggestions? -Yonik http://www.lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
On Fri, Jul 23, 2010 at 2:23 PM, MitchK mitc...@web.de wrote: why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. There's already a patch in JIRA that does distributed IDF. Hadoop wouldn't be the right tool for that anyway... it's for batch oriented systems, not low-latency queries. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. That only works if the docs are exactly the same - they may not be. -Yonik http://www.lucidimagination.com
Re: a bug of solr distributed search
... Additionally to my previous posting: To keep this sync we could do two things: Waiting for every server to make sure that everyone uses the same values to compute the score and than apply them. Or: Let's say that we collect the new values every 15 minutes. To merge and send them over the network, we declare that this will need 3 additionally minutes (We want to keep the network traffic for such actions very low, so we do not send everything instantly). Okay, and now we say 2 additionally minutes, if 3 were not enough or something needs a little bit more time than we tought.. After those 2 minutes, every node has to apply the new values. Pro: If one node gets broken, we do not delay the Application of the new values. Con: We need two HashMaps and both will have roughly the same sice. That means we will waste some RAM for this operation, if we do not write the values to disk (Which I do not suggest). Thoughts? - Mitch MitchK wrote: Yonik, why do we do not send the output of TermsComponent of every node in the cluster to a Hadoop instance? Since TermsComponent does the map-part of the map-reduce concept, Hadoop only needs to reduce the stuff. Maybe we even do not need Hadoop for this. After reducing, every node in the cluster gets the current values to compute the idf. We can store this information in a HashMap-based SolrCache (or something like that) to provide constant-time access. To keep the values up to date, we can repeat that after every x minutes. If we got that, it does not care whereas we use doc_X from shard_A or shard_B, since they will all have got the same scores. Even if we got large indices with 10 million or more unique terms, this will only need some megabyte network-traffic. Kind regards, - Mitch Yonik Seeley-2-2 wrote: As the comments suggest, it's not a bug, but just the best we can do for now since our priority queues don't support removal of arbitrary elements. I guess we could rebuild the current priority queue if we detect a duplicate, but that will have an obvious performance impact. Any other suggestions? -Yonik http://www.lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990551.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
That only works if the docs are exactly the same - they may not be. Ahm, what? Why? If the uniqueID is the same, the docs *should* be the same, don't they? -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p990563.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
On Fri, Jul 23, 2010 at 2:40 PM, MitchK mitc...@web.de wrote: That only works if the docs are exactly the same - they may not be. Ahm, what? Why? If the uniqueID is the same, the docs *should* be the same, don't they? Documents aren't supposed to be duplicated across shards... so the presence of multiple docs with the same id is a bug anyway. We've chosen to try and handle it gracefully rather than fail hard. Some people have treated this as a feature - and that's OK as long as expectations are set appropriately. -Yonik http://www.lucidimagination.com
Re: a bug of solr distributed search
As the comments suggest, it's not a bug, but just the best we can do for now since our priority queues don't support removal of arbitrary elements. I guess we could rebuild the current priority queue if we detect a duplicate, but that will have an obvious performance impact. Any other suggestions? -Yonik http://www.lucidimagination.com On Wed, Jul 21, 2010 at 3:13 AM, Li Li fancye...@gmail.com wrote: in QueryComponent.mergeIds. It will remove document which has duplicated uniqueKey with others. In current implementation, it use the first encountered. String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; collapseList.remove(id+); docs.set(i, null);//remove it. // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } It iterate ove ShardResponse by for (ShardResponse srsp : sreq.responses) But the sreq.responses may be different. That is -- shard1's result and shard2's result may interchange position So when an uniqueKey(such as url) occurs in both shard1 and shard2. which one will be used is unpredicatable. But the socre of these 2 docs are different because of different idf. So the same query will get different result. One possible solution is to sort ShardResponse srsp by shard name.
Re: a bug of solr distributed search
: As the comments suggest, it's not a bug, but just the best we can do : for now since our priority queues don't support removal of arbitrary FYI: I updated the DistributedSearch wiki to be more clear about this -- it previously didn't make it explicitly clear that docIds were suppose to be unique across all shards, and suggested that there was specific well definied behavior when they weren't. -Hoss
a bug of solr distributed search
in QueryComponent.mergeIds. It will remove document which has duplicated uniqueKey with others. In current implementation, it use the first encountered. String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; collapseList.remove(id+); docs.set(i, null);//remove it. // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } It iterate ove ShardResponse by for (ShardResponse srsp : sreq.responses) But the sreq.responses may be different. That is -- shard1's result and shard2's result may interchange position So when an uniqueKey(such as url) occurs in both shard1 and shard2. which one will be used is unpredicatable. But the socre of these 2 docs are different because of different idf. So the same query will get different result. One possible solution is to sort ShardResponse srsp by shard name.
Re: a bug of solr distributed search
Li Li, this is the intended behaviour, not a bug. Otherwise you could get back the same record in a response for several times, which may not be intended by the user. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
But users will think there is something wrong with it when he/she search the same query but got different result. 2010/7/21 MitchK mitc...@web.de: Li Li, this is the intended behaviour, not a bug. Otherwise you could get back the same record in a response for several times, which may not be intended by the user. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Ah, okay. I understand your problem. Why should doc x be at position 1 when searching for the first time, and when I search for the 2nd time it occurs at position 8 - right? I am not sure, but I think you can't prevent this without custom coding or making a document's occurence unique. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
yes. This will make user think our search engine has some bug. from the comments of the codes, it needs more things to do if (prevShard != null) { // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } 2010/7/21 MitchK mitc...@web.de: Ah, okay. I understand your problem. Why should doc x be at position 1 when searching for the first time, and when I search for the 2nd time it occurs at position 8 - right? I am not sure, but I think you can't prevent this without custom coding or making a document's occurence unique. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
I don't know much about the code. Maybe you can tell me to what file you are referring? However, from the comments one can see, that the problem is known but one decided to let it happen, because of System requirements in the Java version. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983880.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
How about sorting over the score? Would that be possible? On Jul 21, 2010, at 12:13 AM, Li Li wrote: in QueryComponent.mergeIds. It will remove document which has duplicated uniqueKey with others. In current implementation, it use the first encountered. String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; collapseList.remove(id+); docs.set(i, null);//remove it. // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } It iterate ove ShardResponse by for (ShardResponse srsp : sreq.responses) But the sreq.responses may be different. That is -- shard1's result and shard2's result may interchange position So when an uniqueKey(such as url) occurs in both shard1 and shard2. which one will be used is unpredicatable. But the socre of these 2 docs are different because of different idf. So the same query will get different result. One possible solution is to sort ShardResponse srsp by shard name.
Re: a bug of solr distributed search
It already was sorted by score. The problem here is the following: Shard_A and shard_B contain doc_X and doc_X. If you are querying for something, doc_X could have a score of 1.0 at shard_A and a score of 12.0 at shard_B. You can never be sure which doc Solr sees first. In the bad case, Solr sees the doc_X firstly at shard_A and ignores it at shard_B. That means, that the doc maybe would occur at page 10 in pagination, although it *should* occur at page 1 or 2. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
I think what Siva mean is that when there are docs with the same url, leave the doc whose score is large. This is the right solution. But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. 2010/7/22 MitchK mitc...@web.de: It already was sorted by score. The problem here is the following: Shard_A and shard_B contain doc_X and doc_X. If you are querying for something, doc_X could have a score of 1.0 at shard_A and a score of 12.0 at shard_B. You can never be sure which doc Solr sees first. In the bad case, Solr sees the doc_X firstly at shard_A and ignores it at shard_B. That means, that the doc maybe would occur at page 10 in pagination, although it *should* occur at page 1 or 2. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Distributed Search throws org.apache.solr.common.SolrException: Form_too_large Exception
Hi All, I am trying to do a distributed search and getting the below error. Please let me know if you know how to solve this issue. 18:20:28,200 ERROR [STDERR] org.apache.solr.client.solrj.SolrServerException: Error executing query 18:20:28,200 ERROR [STDERR] at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:96) 18:20:28,200 ERROR [STDERR] at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:109) ... 18:20:28,202 ERROR [STDERR] Caused by: org.apache.solr.common.SolrException: Form_too_large__javalangIllegalStateException_Form_too_large__at_orgmortbayjettyRequestextractParametersRequestjava1273__at_orgmortbayjettyRequestgetParameterMapRequestjava650__ at_orgapachesolrrequestServletSolrParamsinitServletSolrParamsjava29__at_orgapachesolrservletStandardRequestParserparseParamsAndFillStreamsSolrRequestParsersjava392__ at_orgapachesolrservletSolrRequestParsersparseSolrRequestParsersjava113__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava200__ at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089 __at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__ at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__ at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__ at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__ at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__ at_orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_orgmortbayjettyHttpParserparseAvailableHttpParserjava202__ at_orgmortbayjettyHttpConnectionhandleHttpConnectionjava378__at_orgmortbayjettybioSocketConnector$ConnectionrunSocketConnectorjava226__at_orgmortbaythreadBoundedThreadPool$PoolThreadrunBoundedThreadPooljava442 ___Form_too_large__javalangIllegalStateException_Form_too_large__at_orgmortbayjettyRequestextractParametersRequestjava1273__at_orgmortbayjettyRequestgetParameterMapRequestjava650__at_orgapachesolrrequestServletSolrParamsinitServletSolrParamsjava29__ at_orgapachesolrservletStandardRequestParserparseParamsAndFillStreamsSolrRequestParsersjava392__at_orgapachesolrservletSolrRequestParserspa My code: String SOLR_SHARD1 = ap1.corp.org.com:8983/solr/; String SOLR_SHARD2 = ap2.corp.org.com:8983/solr/; String SOLR_SHARDS = SOLR_SHARD1 + , + SOLR_SHARD2; QueryResponse response = null; SolrServer solr = new CommonsHttpSolrServer(http://ap1.corp.org.com:8983/solr/;); String queryStr =...; SolrQuery query = new SolrQuery(); query.setQuery(queryStr); response = solr.query(query); SolrDocumentList docs = response.getResults(); long docNum = docs.getNumFound(); -- View this message in context: http://www.nabble.com/Solr-Distributed-Search-throws-org.apache.solr.common.SolrException%3A-Form_too_large-Exception-tp24295114p24295114.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr distributed search example - exception
Hi Mark, i actually got this error coz i was using an old version of java. now the problem is solved Thanks anyways Raakhi On Tue, Jun 9, 2009 at 11:17 AM, Rakhi Khatwani rkhatw...@gmail.com wrote: Hi Mark, yea i would like to open a JIRA issue for it. how do i go about that? Regards, Raakhi On Mon, Jun 8, 2009 at 7:58 PM, Mark Miller markrmil...@gmail.com wrote: That is a very odd cast exception to get. Do you want to open a JIRA issue for this? It looks like an odd exception because the call is: NodeList nodes = (NodeList)solrConfig.evaluate(configPath, XPathConstants.NODESET); // cast exception is we get an ArrayList rather than NodeList Which leads to: Object o = xpath.evaluate(xstr, doc, type); where type = XPathConstants.NODESET So you get back an Object based on the XPathConstant passed. There does not appear to be a value that would return an ArrayList. Using XPathConstants.NODESET gets you a NodeList according to the XPath API. I'm not sure what could cause this to happen. - Mark Rakhi Khatwani wrote: Hi, I was executing a simple example which demonstrates DistributedSearch. example provided in the following link: http://wiki.apache.org/solr/DistributedSearch however, when i startup the server in both port nos: 8983 and 7574, i get the following exception: SEVERE: Could not start SOLR. Check solr/home property java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.w3c.dom.NodeList at org.apache.solr.search.CacheConfig.getMultipleConfigs(CacheConfig.java:61) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:131) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:70) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at java.lang.reflect.Method.invoke(libgcj.so.7rh) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) 2009-06-08 18:36:28.016::WARN: failed SolrRequestFilter java.lang.NoClassDefFoundError: org.apache.solr.core.SolrCore at java.lang.Class.initializeClass(libgcj.so.7rh) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:77) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at
Re: solr distributed search example - exception
Thanks for bringing closure to this Raakhi. - Mark Rakhi Khatwani wrote: Hi Mark, i actually got this error coz i was using an old version of java. now the problem is solved Thanks anyways Raakhi On Tue, Jun 9, 2009 at 11:17 AM, Rakhi Khatwani rkhatw...@gmail.com wrote: Hi Mark, yea i would like to open a JIRA issue for it. how do i go about that? Regards, Raakhi On Mon, Jun 8, 2009 at 7:58 PM, Mark Miller markrmil...@gmail.com wrote: That is a very odd cast exception to get. Do you want to open a JIRA issue for this? It looks like an odd exception because the call is: NodeList nodes = (NodeList)solrConfig.evaluate(configPath, XPathConstants.NODESET); // cast exception is we get an ArrayList rather than NodeList Which leads to: Object o = xpath.evaluate(xstr, doc, type); where type = XPathConstants.NODESET So you get back an Object based on the XPathConstant passed. There does not appear to be a value that would return an ArrayList. Using XPathConstants.NODESET gets you a NodeList according to the XPath API. I'm not sure what could cause this to happen. - Mark Rakhi Khatwani wrote: Hi, I was executing a simple example which demonstrates DistributedSearch. example provided in the following link: http://wiki.apache.org/solr/DistributedSearch however, when i startup the server in both port nos: 8983 and 7574, i get the following exception: SEVERE: Could not start SOLR. Check solr/home property java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.w3c.dom.NodeList at org.apache.solr.search.CacheConfig.getMultipleConfigs(CacheConfig.java:61) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:131) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:70) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at java.lang.reflect.Method.invoke(libgcj.so.7rh) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) 2009-06-08 18:36:28.016::WARN: failed SolrRequestFilter java.lang.NoClassDefFoundError: org.apache.solr.core.SolrCore at java.lang.Class.initializeClass(libgcj.so.7rh) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:77) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at
solr distributed search example - exception
Hi, I was executing a simple example which demonstrates DistributedSearch. example provided in the following link: http://wiki.apache.org/solr/DistributedSearch however, when i startup the server in both port nos: 8983 and 7574, i get the following exception: SEVERE: Could not start SOLR. Check solr/home property java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.w3c.dom.NodeList at org.apache.solr.search.CacheConfig.getMultipleConfigs(CacheConfig.java:61) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:131) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:70) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at java.lang.reflect.Method.invoke(libgcj.so.7rh) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) 2009-06-08 18:36:28.016::WARN: failed SolrRequestFilter java.lang.NoClassDefFoundError: org.apache.solr.core.SolrCore at java.lang.Class.initializeClass(libgcj.so.7rh) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:77) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at java.lang.reflect.Method.invoke(libgcj.so.7rh) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) Caused by: java.lang.ClassNotFoundException: org.apache.solr.core.JmxMonitoredMap not found in StartLoader[file:/home/ithurs/apache-solr-1.3.0/example7574/, file:/home/ithurs/apache-solr-1.3.0/example7574/lib/jetty-6.1.3.jar, file:/home/ithurs/apache-solr-1.3.0/example7574/lib/jetty-util-6.1.3.jar, file:/home/ithurs/apache-solr-1.3.0/example7574/lib/servlet-api-2.5-6.1.3.jar] at java.net.URLClassLoader.findClass(libgcj.so.7rh) at java.lang.ClassLoader.loadClass(libgcj.so.7rh) at java.lang.ClassLoader.loadClass(libgcj.so.7rh) at org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:375) at org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:337) at java.lang.Class.forName(libgcj.so.7rh) at java.lang.Class.initializeClass(libgcj.so.7rh) ...22 more 2009-06-08
Re: solr distributed search example - exception
Hi Mark, yea i would like to open a JIRA issue for it. how do i go about that? Regards, Raakhi On Mon, Jun 8, 2009 at 7:58 PM, Mark Miller markrmil...@gmail.com wrote: That is a very odd cast exception to get. Do you want to open a JIRA issue for this? It looks like an odd exception because the call is: NodeList nodes = (NodeList)solrConfig.evaluate(configPath, XPathConstants.NODESET); // cast exception is we get an ArrayList rather than NodeList Which leads to: Object o = xpath.evaluate(xstr, doc, type); where type = XPathConstants.NODESET So you get back an Object based on the XPathConstant passed. There does not appear to be a value that would return an ArrayList. Using XPathConstants.NODESET gets you a NodeList according to the XPath API. I'm not sure what could cause this to happen. - Mark Rakhi Khatwani wrote: Hi, I was executing a simple example which demonstrates DistributedSearch. example provided in the following link: http://wiki.apache.org/solr/DistributedSearch however, when i startup the server in both port nos: 8983 and 7574, i get the following exception: SEVERE: Could not start SOLR. Check solr/home property java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.w3c.dom.NodeList at org.apache.solr.search.CacheConfig.getMultipleConfigs(CacheConfig.java:61) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:131) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:70) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at java.lang.reflect.Method.invoke(libgcj.so.7rh) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) 2009-06-08 18:36:28.016::WARN: failed SolrRequestFilter java.lang.NoClassDefFoundError: org.apache.solr.core.SolrCore at java.lang.Class.initializeClass(libgcj.so.7rh) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:77) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at java.lang.reflect.Method.invoke(libgcj.so.7rh) at org.mortbay.start.Main.invokeMain(Main.java:183) at
Re: Question on Solr Distributed Search
On Fri, Apr 10, 2009 at 7:50 AM, vivek sar vivex...@gmail.com wrote: Just an update. I changed the schema to store the unique id field, but I still get the connection reset exception. I did notice that if there is no data in the core then it returns the 0 result (no exception), but if there is data and you search using shards parameter I get the connection reset exception. Can anyone provide some tip on where can I look for this problem? Did you re-index after changing the field to stored? -- Regards, Shalin Shekhar Mangar.
Re: Question on Solr Distributed Search
yes - it's all new indexes. I can search them individually, but adding shards throws Connection Reset error. Is there any way I can debug this or any other pointers? -vivek On Fri, Apr 10, 2009 at 4:49 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Fri, Apr 10, 2009 at 7:50 AM, vivek sar vivex...@gmail.com wrote: Just an update. I changed the schema to store the unique id field, but I still get the connection reset exception. I did notice that if there is no data in the core then it returns the 0 result (no exception), but if there is data and you search using shards parameter I get the connection reset exception. Can anyone provide some tip on where can I look for this problem? Did you re-index after changing the field to stored? -- Regards, Shalin Shekhar Mangar.
Question on Solr Distributed Search
Hi, I've another thread on multi-core distributed search, but just wanted to put a simple question here on distributed search to get some response. I've a search query, http://etsx19.co.com:8080/solr/20090409_9/select?q=usa - returns with 10 result now if I add shards parameter to it, http://etsx19.co.com:8080/solr/20090409_9/select?shards=etsx19.co.com:8080/solr/20090409_9q=usa - this fails with org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:282) at .. at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:637) Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:473) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:242) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:422) .. Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) Attached is my solrconfig.xml. Do I need a special RequestHandler for sharding? I haven't been able to make any distributed search successfully. Any help is appreciated. Note: I'm indexing using Solrj - not sure if that makes any difference to the search part. Thanks, -vivek ?xml version=1.0 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- config !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- !-- dataDir./solr/data/dataDir -- indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFiletrue/useCompoundFile mergeFactor100/mergeFactor !-- maxBufferedDocs1/maxBufferedDocs -- ramBufferSizeMB64/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults mainIndex !-- options specific to the main on-disk lucene index -- useCompoundFiletrue/useCompoundFile mergeFactor100/mergeFactor !-- maxBufferedDocs1000/maxBufferedDocs -- !-- Tell Lucene when to flush documents to disk. Giving Lucene more memory for indexing means faster indexing at the cost of more RAM If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. -- ramBufferSizeMB64/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength !-- If true, unlock any held write or commit locks on startup. This defeats the locking mechanism that allows multiple processes to safely access a lucene index, and should be used with care. -- unlockOnStartuptrue/unlockOnStartup lockTypesingle/lockType /mainIndex !-- the
Re: Question on Solr Distributed Search
I think the reason behind the connection reset is. Looking at the code it points to QueryComponent.mergeIds() resultIds.put(shardDoc.id.toString(), shardDoc); looks like the doc unique id is returning null. I'm not sure how is it possible as its a required field. Right my unique id is not stored (only indexed) - does it has to be stored for distributed search? HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:432) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:276) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:637) On Thu, Apr 9, 2009 at 5:01 PM, vivek sar vivex...@gmail.com wrote: Hi, I've another thread on multi-core distributed search, but just wanted to put a simple question here on distributed search to get some response. I've a search query, http://etsx19.co.com:8080/solr/20090409_9/select?q=usa - returns with 10 result now if I add shards parameter to it, http://etsx19.co.com:8080/solr/20090409_9/select?shards=etsx19.co.com:8080/solr/20090409_9q=usa - this fails with org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:282) at .. at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:637) Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:473) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:242) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:422) .. Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) Attached is my solrconfig.xml. Do I need a special RequestHandler for sharding? I haven't been able to make any distributed search successfully. Any help is appreciated. Note: I'm indexing using Solrj - not sure if that makes any difference to the search part. Thanks, -vivek
Re: Question on Solr Distributed Search
Just an update. I changed the schema to store the unique id field, but I still get the connection reset exception. I did notice that if there is no data in the core then it returns the 0 result (no exception), but if there is data and you search using shards parameter I get the connection reset exception. Can anyone provide some tip on where can I look for this problem? Apr 10, 2009 3:16:04 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:282) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:637) Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:473) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:242) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:422) at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:395) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) ... 1 more Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) On Thu, Apr 9, 2009 at 6:51 PM, vivek sar vivex...@gmail.com wrote: I think the reason behind the connection reset is. Looking at the code it points to QueryComponent.mergeIds() resultIds.put(shardDoc.id.toString(), shardDoc); looks like the doc unique id is returning null. I'm not sure how is it possible as its a required field. Right my unique id is not stored (only indexed) - does it has to be stored for distributed search? HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:432) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:276) at