Slow ReadProcessor read fields Warnings - Ideas to investigate?
Hello User Group, we run Solr with HDFS and got a lot of the following warning: Slow ReadProcessor read fields took 15093ms (threshold=1ms); ack: seqno: 3 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 798309 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[xxx.xxx.xxx.xxx:50010,DS-xx,DISK], DatanodeInfoWithStorage[xxx.xxx.xxx.xxx:50010,DS-xx,DISK], DatanodeInfoWithStorage[xxx.xxx.xxx.xxx:50010,DS-xx,DISK]] It started with the default threshold of 30 seconds. But 10 seconds are still too much for a query and we configured the warning threshold to 10 seconds. It resulted in a flood of warnings and uncovered a slow HDFS read performance. The HDFS statistics looks quite good and stable. We are not sure how to investigate the reason and what we can improve to solve the issue. Did anybody have similar issues? Mit freundlichen Grüßen / Kind regards David Winter
Re: Looking for design ideas
Steve Does a document have a different URL when it is in a personal DB? I suspect the easiest solution is to use just one index. You can have a field containing an integer identifying the personal DB. For public, set this to zero. Call it DBid. Update the doc to change this and the URL when the user starts editing. Then the query contains the userid, and you boost on this field. Or something like that. Cheers -- Rick On March 18, 2018 11:13:49 AM EDT, Steven White <swhite4...@gmail.com> wrote: >Hi everyone, > >I have a design problem that i"m not sure how to solve best so I >figured I >share it here and see what ideas others may have. > >I have a DB that hold documents (over 1 million and growing). This is >known as the "Public" DB that holds documents visible to all of my end >users. > >My application let users "check-out" one or more documents at a time >off >this "Public" DB, edit them and "check-in" back into the "Public" DB. >When >a document is checked-out, it goes into a "Personal" DB for that user >(and >the document in the "Public" DB is flagged as such to alert other >users.) >The owner of this checked-out document in the "Personal" DB can make >changes to the document and save it back into the "Personal" DB as >often as >he wants to. Sometimes the document lives in the "Personal" DB for few >minutes before it is checked-in back into the "Public" DB and sometimes >it >can live in the "Personal" DB for 1 day or 1 month. When a document is >saved into the "Personal" DB, only the owner of that document can see >it. > >Currently there are 100 users but this will grow to at least 500 or >maybe >even 1000. > >I'm looking at a solution on how to enable a full text search on those >documents, both in the "Public" and "Personal" DB so that: > >1) Documents in the "Public" DB are searchable by all users. This is >the >easy part. > >2) Documents in the "Personal" DB of each user is searchable by the >owner >of that "Personal" DB. This is easy too. > >3) A user can search both the "Public" and "Personal" DB at anytime but >if >a document is in the "Personal" DB, we will not search it the "Public" >-- >i.e.: whatever is in "Personal" DB takes over what's in the "Public" >DB. > >Item #3 is important and is what I'm trying to solve. The goal is to >give >hits to the user on documents that they are editing (in their >"Personal" >DB) instead of that in the "Public". > >The way I'm thinking to solve this problem is to create 2 Solr indexes >(do >we call those "cores"?): > >1) The "Public" DB is indexed into the "Public" Solr index. > >2) The "Personal" DB is indexed into the "Personal" Solr index with a >field >indicating the owner of that document. > >With the above 2 indexes, I can now send the user's search syntax to >both >indexes but for the "Public", I will also send a list of IDs (those >documents in the user's "Personal" DB) to exclude from the result set. >This way, I let a user search both the "Public" and "Personal" DB as >such >the documents in the "Personal" DB are included in the search and are >excluded from the "Public" DB. > >Did I make sense? If so, is this doable? Will ranking be effected >given >that I'm searching 2 indexes? > >Let me know what issues I might be overlooking with this solution. > >Thanks > >Steve -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Looking for design ideas
I’ve worked on something similar - data set was 100m documents with thousands of users. The ranking is relative in each index. Eg. What is #1 , #2, #3 is only 1,2,3 in that index. Your challenge will in the user interface result display: how to merge results in a way that the relevant results are shown first before non relevant results. There are numerous ways to merge — could even retrieve , merge, index, and retrieve from that — but computing power aside, that’s not efficient. You could consider two indexes not as public and private but as a metadata (data indexed only, not stored) and data (index / stored values). This way you’ll get your ranking without having to compromise. Once you have your doc ids , you can retrieve from a data index / read only SolR cluster or a scalable persistent store (Cassandra, Mongo, etc. ) that would scale way better than SolR itself for thousands if not millions of users ( please let’s not start a debate about this ). This way your users would have relevant results, and fast access to the index , the data would be protected - if you filter by the doc owner Id as a “or” query in addition to doc owner I’d = ‘public’. What you lose in not getting the document Data from the initial query you can retrieve asynchronously or maybe “join” with another collection — which I’ve not done but I know it’s possible. Also may want to consider CQRS pattern for doc checkin / checkout Actions to keep the indexing / query time scalable. It may be more work but it’s more scalable. Go big or go home. ;) Hope it helps -- Rahul Singh rahul.si...@anant.us Anant Corporation On Mar 18, 2018, 11:14 AM -0400, Steven White <swhite4...@gmail.com>, wrote: > Hi everyone, > > I have a design problem that i"m not sure how to solve best so I figured I > share it here and see what ideas others may have. > > I have a DB that hold documents (over 1 million and growing). This is > known as the "Public" DB that holds documents visible to all of my end > users. > > My application let users "check-out" one or more documents at a time off > this "Public" DB, edit them and "check-in" back into the "Public" DB. When > a document is checked-out, it goes into a "Personal" DB for that user (and > the document in the "Public" DB is flagged as such to alert other users.) > The owner of this checked-out document in the "Personal" DB can make > changes to the document and save it back into the "Personal" DB as often as > he wants to. Sometimes the document lives in the "Personal" DB for few > minutes before it is checked-in back into the "Public" DB and sometimes it > can live in the "Personal" DB for 1 day or 1 month. When a document is > saved into the "Personal" DB, only the owner of that document can see it. > > Currently there are 100 users but this will grow to at least 500 or maybe > even 1000. > > I'm looking at a solution on how to enable a full text search on those > documents, both in the "Public" and "Personal" DB so that: > > 1) Documents in the "Public" DB are searchable by all users. This is the > easy part. > > 2) Documents in the "Personal" DB of each user is searchable by the owner > of that "Personal" DB. This is easy too. > > 3) A user can search both the "Public" and "Personal" DB at anytime but if > a document is in the "Personal" DB, we will not search it the "Public" -- > i.e.: whatever is in "Personal" DB takes over what's in the "Public" DB. > > Item #3 is important and is what I'm trying to solve. The goal is to give > hits to the user on documents that they are editing (in their "Personal" > DB) instead of that in the "Public". > > The way I'm thinking to solve this problem is to create 2 Solr indexes (do > we call those "cores"?): > > 1) The "Public" DB is indexed into the "Public" Solr index. > > 2) The "Personal" DB is indexed into the "Personal" Solr index with a field > indicating the owner of that document. > > With the above 2 indexes, I can now send the user's search syntax to both > indexes but for the "Public", I will also send a list of IDs (those > documents in the user's "Personal" DB) to exclude from the result set. > This way, I let a user search both the "Public" and "Personal" DB as such > the documents in the "Personal" DB are included in the search and are > excluded from the "Public" DB. > > Did I make sense? If so, is this doable? Will ranking be effected given > that I'm searching 2 indexes? > > Let me know what issues I might be overlooking with this solution. > > Thanks > > Steve
Looking for design ideas
Hi everyone, I have a design problem that i"m not sure how to solve best so I figured I share it here and see what ideas others may have. I have a DB that hold documents (over 1 million and growing). This is known as the "Public" DB that holds documents visible to all of my end users. My application let users "check-out" one or more documents at a time off this "Public" DB, edit them and "check-in" back into the "Public" DB. When a document is checked-out, it goes into a "Personal" DB for that user (and the document in the "Public" DB is flagged as such to alert other users.) The owner of this checked-out document in the "Personal" DB can make changes to the document and save it back into the "Personal" DB as often as he wants to. Sometimes the document lives in the "Personal" DB for few minutes before it is checked-in back into the "Public" DB and sometimes it can live in the "Personal" DB for 1 day or 1 month. When a document is saved into the "Personal" DB, only the owner of that document can see it. Currently there are 100 users but this will grow to at least 500 or maybe even 1000. I'm looking at a solution on how to enable a full text search on those documents, both in the "Public" and "Personal" DB so that: 1) Documents in the "Public" DB are searchable by all users. This is the easy part. 2) Documents in the "Personal" DB of each user is searchable by the owner of that "Personal" DB. This is easy too. 3) A user can search both the "Public" and "Personal" DB at anytime but if a document is in the "Personal" DB, we will not search it the "Public" -- i.e.: whatever is in "Personal" DB takes over what's in the "Public" DB. Item #3 is important and is what I'm trying to solve. The goal is to give hits to the user on documents that they are editing (in their "Personal" DB) instead of that in the "Public". The way I'm thinking to solve this problem is to create 2 Solr indexes (do we call those "cores"?): 1) The "Public" DB is indexed into the "Public" Solr index. 2) The "Personal" DB is indexed into the "Personal" Solr index with a field indicating the owner of that document. With the above 2 indexes, I can now send the user's search syntax to both indexes but for the "Public", I will also send a list of IDs (those documents in the user's "Personal" DB) to exclude from the result set. This way, I let a user search both the "Public" and "Personal" DB as such the documents in the "Personal" DB are included in the search and are excluded from the "Public" DB. Did I make sense? If so, is this doable? Will ranking be effected given that I'm searching 2 indexes? Let me know what issues I might be overlooking with this solution. Thanks Steve
Re: Different ideas for querying unique and non-unique records
Susheel, Just a guess, but carrot2.org might be useful. But it might be overkill. Cheers -- Rick On August 30, 2017 7:40:08 AM MDT, Susheel Kumar <susheel2...@gmail.com> wrote: >Hello, > >I am looking for different ideas/suggestions to solve the use case am >working on. > >We have couple of fields in schema along with id, business_email and >personal_email. We need to return all records based on unique business >and >personal email's. > >The criteria for unique records is either of business or personal email >has >not repeated again in other records. >The criteria for non-unique records is if any of the business or >personal >email has occurred/repeats in other records then all those records are >non-unique. >E.g considering below documents. >- for unique records below only id=1 should be returned (since john.doe >is >not present in any other records personal or business email) >- non unique records, below id=2,3 should be returned (since >isabel.dora is >present in multiple records. doesn't matter if it is present in >business or >personal email) > >Documents >=== >{id:1,business_email_s:john@abc.com,personal_email_s:john@abc.com} >{id:2,business_email_s:isabel.d...@abc.com} >{id:3,personal_email_s:isabel.d...@abc.com} > >I am able to solve this using Streaming expression query but not sure >if >performance will become an bottleneck as the streaming expression is >quite >big. So looking for >different ideas like using de-dupe or during ingestion/pre-process etc. >without impacting performance much. > >Thanks, >Susheel -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Different ideas for querying unique and non-unique records
Hello, I am looking for different ideas/suggestions to solve the use case am working on. We have couple of fields in schema along with id, business_email and personal_email. We need to return all records based on unique business and personal email's. The criteria for unique records is either of business or personal email has not repeated again in other records. The criteria for non-unique records is if any of the business or personal email has occurred/repeats in other records then all those records are non-unique. E.g considering below documents. - for unique records below only id=1 should be returned (since john.doe is not present in any other records personal or business email) - non unique records, below id=2,3 should be returned (since isabel.dora is present in multiple records. doesn't matter if it is present in business or personal email) Documents === {id:1,business_email_s:john@abc.com,personal_email_s:john@abc.com} {id:2,business_email_s:isabel.d...@abc.com} {id:3,personal_email_s:isabel.d...@abc.com} I am able to solve this using Streaming expression query but not sure if performance will become an bottleneck as the streaming expression is quite big. So looking for different ideas like using de-dupe or during ingestion/pre-process etc. without impacting performance much. Thanks, Susheel
Re: Ideas
Writing a query component would be pretty easy or? It would throw an exception if crazy numbers are requested... I can provide a simple example of a maven project for a query component. Paul William Bell wrote: > We have some Denial of service attacks on our web site. SOLR threads are > going crazy. > > Basically someone is hitting start=15 + and rows=20. The start is crazy > large. > > And then they jump around. start=15 then start=213030 etc. > > Any ideas for how to stop this besides blocking these IPs? > > Sometimes it is Google doing it even though these search results are set > with No-index and No-Follow on these pages. > > Thoughts? Ideas?
Re: Ideas
Hi Bill, the classical way would be to have a reverse proxy in front of the application that catches such cases. A decent reverse proxy or even application firewall router will allow you to define limits on bandwidth and sessions per time unit. Some even recognize specific denial-of-service patterns. Of course, you could also simply limit the ranges of parameters accepted over the Internet - unless these wild ranges may actually occur in valid scenarios. A bit more complex is the third alternative that requires valid sessions and permits paging only in one or the other direction. This way, start and offset values would not be exposed, only functions for next page/previous page or maybe some larger steps would be supported. Stepping to one offset would also only be permitted if you come from a proper previous page. Initial requests (in new sessions) would have to start at offset 1. Constraints on the parameters in subsequent requests within a session are a bit harder to handle. Cheers, --Jürgen On 21.09.2015 19:28, William Bell wrote: > We have some Denial of service attacks on our web site. SOLR threads are > going crazy. > > Basically someone is hitting start=15 + and rows=20. The start is crazy > large. > > And then they jump around. start=15 then start=213030 etc. > > Any ideas for how to stop this besides blocking these IPs? > > Sometimes it is Google doing it even though these search results are set > with No-index and No-Follow on these pages. > > Thoughts? Ideas? > > Thanks > Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center "Intelligence" & Senior Cloud Consultant DevoteThem GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com <mailto:juergen.wag...@devoteam.com>, URL: www.devoteam.de <http://www.devoteam.de/> Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: Ideas
The nginx reverse proxy we use blocks ridicilous start and rows values https://github.com/o19s/solr_nginx Another silly thing I've noticed is you can pass sleep() as a function query. It's not documented, but I think a big hole. I wonder if I could DoS your Solr by sleeping and hogging all the available query threads? http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.0/org/apache/solr/search/ValueSourceParser.java#114 On Mon, Sep 21, 2015 at 1:37 PM, Jürgen Wagner (DVT) < juergen.wag...@devoteam.com> wrote: > Hi Bill, > the classical way would be to have a reverse proxy in front of the > application that catches such cases. A decent reverse proxy or even > application firewall router will allow you to define limits on bandwidth > and sessions per time unit. Some even recognize specific denial-of-service > patterns. > > Of course, you could also simply limit the ranges of parameters accepted > over the Internet - unless these wild ranges may actually occur in valid > scenarios. > > A bit more complex is the third alternative that requires valid sessions > and permits paging only in one or the other direction. This way, start and > offset values would not be exposed, only functions for next page/previous > page or maybe some larger steps would be supported. Stepping to one offset > would also only be permitted if you come from a proper previous page. > Initial requests (in new sessions) would have to start at offset 1. > Constraints on the parameters in subsequent requests within a session are a > bit harder to handle. > > Cheers, > --Jürgen > > On 21.09.2015 19:28, William Bell wrote: > > We have some Denial of service attacks on our web site. SOLR threads are > going crazy. > > Basically someone is hitting start=15 + and rows=20. The start is crazy > large. > > And then they jump around. start=15 then start=213030 etc. > > Any ideas for how to stop this besides blocking these IPs? > > Sometimes it is Google doing it even though these search results are set > with No-index and No-Follow on these pages. > > Thoughts? Ideas? > > Thanks > > > > Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С > уважением > > *i.A. Jürgen Wagner* > Head of Competence Center "Intelligence" > & Senior Cloud Consultant > > DevoteThem GmbH, Industriestr. 3, 70565 Stuttgart, Germany > Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 > E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de > -- > Managing Board: Jürgen Hatzipantelis (CEO) > Address of Record: 64331 Weiterstadt, Germany; Commercial Register: > Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 > > > > -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections <http://opensourceconnections.com>, LLC | 240.476.9983 Author: Relevant Search <http://manning.com/turnbull> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Ideas
We have some Denial of service attacks on our web site. SOLR threads are going crazy. Basically someone is hitting start=15 + and rows=20. The start is crazy large. And then they jump around. start=15 then start=213030 etc. Any ideas for how to stop this besides blocking these IPs? Sometimes it is Google doing it even though these search results are set with No-index and No-Follow on these pages. Thoughts? Ideas? Thanks -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Ideas
I have put a limit in the front end at a couple of sites. Nobody gets more than 50 pages of results. Show page 50 if they request beyond that. First got hit by this at Netflix, years ago. Solr 4 is much better about deep paging, but here at Chegg we got deep paging plus a stupid, long query. That was using too much CPU. Right now, block the IPs. Those are hostile. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 21, 2015, at 10:31 AM, Paul Libbrecht <p...@hoplahup.net> wrote: > > Writing a query component would be pretty easy or? > It would throw an exception if crazy numbers are requested... > > I can provide a simple example of a maven project for a query component. > > Paul > > > William Bell wrote: >> We have some Denial of service attacks on our web site. SOLR threads are >> going crazy. >> >> Basically someone is hitting start=15 + and rows=20. The start is crazy >> large. >> >> And then they jump around. start=15 then start=213030 etc. >> >> Any ideas for how to stop this besides blocking these IPs? >> >> Sometimes it is Google doing it even though these search results are set >> with No-index and No-Follow on these pages. >> >> Thoughts? Ideas? >
Re: Ideas for debugging poor SolrCloud scalability
Hi again, all - Since several people were kind enough to jump in to offer advice on this thread, I wanted to follow up in case anyone finds this useful in the future. *tl;dr: *Routing updates to a random Solr node (and then letting it forward the docs to where they need to go) is very expensive, more than I expected. Using a smart router that uses the cluster config to route documents directly to their shard results in (near) linear scaling for us. *Expository version:* We use Go on our client, for which (to my knowledge) there is no SolrCloud router implementation. So we started by just routing updates to a random Solr node and letting it forward the docs to where they need to go. My theory was that this would lead to a constant amount of additional work (and thus still linear scaling). This was based on the observation that if you send an update of K documents to a Solr node in a N node cluster, in the worst case scenario, all K documents will need to be forwarded on to other nodes. Since Solr nodes have perfect knowledge of where docs belong, each doc would only take 1 additional hop to get to its replica. So random routing (in the limit) imposes 1 additional network hop for each document. In practice, however, we find that (for small networks, at least) per-node performance falls as you add shards. In fact, the client performance (in writes/sec) was essentially constant no matter how many shards we added. I do have a working theory as to why this might be (i.e. where the flaw is in my logic above) but as this is merely an unverified theory I don't want to lead anyone astray by writing it up here. However, by writing a smart router that retrieves the clusterstate.json file from Zookeeper and uses that to perfectly route documents to their proper shard, we were able to achieve much better scalability. Using a synthetic workload, we were able to achieve 141.7 writes/sec to a cluster of size 1 and 2506 writes/sec to a cluster of size 20 (125 writes/sec/node). So a dropoff of ~12% which is not too bad. We are hoping to continue our tests with larger clusters to ensure that the per-node write performance levels off and does not continue to drop as the cluster scales. I will also note that we initially had several bugs in our smart router implementation so if you follow a similar path and see bad performance look to your router implementation as you might not be routing correctly. We ended up writing a simple proxy that we ran in front of Solr to observe all requests which helped immensely when verifying and debugging our router. Yes tcpdump does something similar but viewing HTTP-level traffic is way more convenient than TCP-level. Plus Go makes little proxies like this super easy to do. Hope all that is useful to someone. Thanks again to the posters above for providing suggestions! - Ian On Sat, Nov 1, 2014 at 7:13 PM, Erick Erickson erickerick...@gmail.com wrote: bq: but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Not really. You've stated that you're not driving Solr very hard in your tests. Therefore you're waiting on I/O. Therefore your tests just aren't going to scale linearly with the number of shards. This is a simplification, but Your network utilization is pretty much irrelevant. I send a packet somewhere. somewhere does some stuff and sends me back an acknowledgement. While I'm waiting, the network is getting no traffic, so. If the network traffic was in the 90% range that would be different, so it's a good thing to monitor. Really, use a leader aware client and rack enough clients together that you're driving Solr hard. Then double the number of shards. Then rack enough _more_ clients to drive Solr at the same level. In this case I'll go out on a limb and predict near 2x throughput increases. One additional note, though. When you add _replicas_ to shards expect to see a drop in throughput that may be quite significant, 20-40% anecdotally... Best, Erick On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/1/2014 9:52 AM, Ian Rose wrote: Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. The basic problem I see with your methodology is that you are sending an update request and waiting for it to complete before sending another. No
Re: Ideas for debugging poor SolrCloud scalability
On 11/7/2014 7:17 AM, Ian Rose wrote: *tl;dr: *Routing updates to a random Solr node (and then letting it forward the docs to where they need to go) is very expensive, more than I expected. Using a smart router that uses the cluster config to route documents directly to their shard results in (near) linear scaling for us. I will admit that I do not know everything that has to happen in order to bounce updates to the proper shard leader, but I would have expected the overhead involved to be relatively small. I have opened an issue so we can see whether this situation can be improved. https://issues.apache.org/jira/browse/SOLR-6717 Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
Ian: Thanks much for the writeup! It's always good to have real-world documentation! Best, Erick On Fri, Nov 7, 2014 at 8:26 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/7/2014 7:17 AM, Ian Rose wrote: *tl;dr: *Routing updates to a random Solr node (and then letting it forward the docs to where they need to go) is very expensive, more than I expected. Using a smart router that uses the cluster config to route documents directly to their shard results in (near) linear scaling for us. I will admit that I do not know everything that has to happen in order to bounce updates to the proper shard leader, but I would have expected the overhead involved to be relatively small. I have opened an issue so we can see whether this situation can be improved. https://issues.apache.org/jira/browse/SOLR-6717 Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
Erick, Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. Cheers, Ian On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan peterlkee...@gmail.com wrote: Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com wrote: Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which
Re: Ideas for debugging poor SolrCloud scalability
bq: but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Not really. You've stated that you're not driving Solr very hard in your tests. Therefore you're waiting on I/O. Therefore your tests just aren't going to scale linearly with the number of shards. This is a simplification, but Your network utilization is pretty much irrelevant. I send a packet somewhere. somewhere does some stuff and sends me back an acknowledgement. While I'm waiting, the network is getting no traffic, so. If the network traffic was in the 90% range that would be different, so it's a good thing to monitor. Really, use a leader aware client and rack enough clients together that you're driving Solr hard. Then double the number of shards. Then rack enough _more_ clients to drive Solr at the same level. In this case I'll go out on a limb and predict near 2x throughput increases. One additional note, though. When you add _replicas_ to shards expect to see a drop in throughput that may be quite significant, 20-40% anecdotally... Best, Erick On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/1/2014 9:52 AM, Ian Rose wrote: Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. The basic problem I see with your methodology is that you are sending an update request and waiting for it to complete before sending another. No matter how big the batches are, this is an inefficient use of resources. If you send many such requests at the same time, then they will be handled in parallel. Lucene (and by extension, Solr) has the thread synchronization required to keep multiple simultaneous update requests from stomping on each other and corrupting the index. If you have enough CPU cores, such handling will *truly* be in parallel, otherwise the operating system will just take turns giving each thread CPU time. This results in a pretty good facsimile of parallel operation, but because it splits the available CPU resources, isn't as fast as true parallel operation. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client. This set of operations will eventually degrade the scaling. bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. If we're talking search requests, the answer is no. Sharding is what you do when your collection no longer fits on a single node. If it _does_ fit on a single node, then you'll usually get better query performance by adding a bunch of replicas to a single shard. When the number of docs on each shard grows large enough that you no longer get good query performance, _then_ you shard. And take the query hit. If we're talking about inserts, then see above. I suspect your problem is that you're _not_ saturating the SolrCloud cluster, you're sending docs to Solr very inefficiently and waiting on I/O. Batching docs and sending them to the right leader should scale pretty linearly until you start saturating your network. Best, Erick On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote: Thanks for the suggestions so for, all. 1) We are not using SolrJ on the client (not using Java at all) but I am working on writing a smart router so that we can always send to the correct node. I am certainly curious to see how that changes things. Nonetheless even with the overhead of extra routing hops, the observed behavior (no increase in performance with more nodes) doesn't make any sense to me. 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6) and autoSoftCommit (maxTime=15000). 3) Suggestions to batch documents certainly make sense for production code but
Re: Ideas for debugging poor SolrCloud scalability
NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client. This set of operations will eventually degrade the scaling. bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. If we're talking search requests, the answer is no. Sharding is what you do when your collection no longer fits on a single node. If it _does_ fit on a single node, then you'll usually get better query performance by adding a bunch of replicas to a single shard. When the number of docs on each shard grows large enough that you no longer get good query performance, _then_ you shard. And take the query hit. If we're talking about inserts, then see above. I suspect your problem is that you're _not_ saturating the SolrCloud cluster,
Re: Ideas for debugging poor SolrCloud scalability
Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client. This set of operations will eventually degrade the scaling. bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. If we're talking search
Re: Ideas for debugging poor SolrCloud scalability
Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client.
Re: Ideas for debugging poor SolrCloud scalability
Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com wrote: Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling.
Ideas for debugging poor SolrCloud scalability
Howdy all - The short version is: We are not seeing Solr Cloud performance scale (event close to) linearly as we add nodes. Can anyone suggest good diagnostics for finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud fail to scale? In detail: We have used Solr (in non-Cloud mode) for over a year and are now beginning a transition to SolrCloud. To this end I have been running some basic load tests to figure out what kind of capacity we should expect to provision. In short, I am seeing very poor scalability (increase in effective QPS) as I add Solr nodes. I'm hoping to get some ideas on where I should be looking to debug this. Apologies in advance for the length of this email; I'm trying to be comprehensive and provide all relevant information. Our setup: 1 load generating client - generates tiny, fake documents with unique IDs - performs only writes (no queries at all) - chooses a random solr server for each ADD request (with 1 doc per add request) N collections spread over K solr servers - every collection is sharded K times (so every solr instance has 1 shard from every collection) - no replicas - external zookeeper server (not using zkRun) - autoCommit maxTime=6 - autoSoftCommit maxTime =15000 Everything is running within a single zone on Google Compute Engine, so high quality gigabit network links between all machines (ping times 1ms). My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. In brief (more detailed results at the bottom of email), I find that for any number of nodes between 2 and 5, the QPS always caps out at ~3000. Obviously something must be wrong here, as there should be a trend of the QPS scaling (roughly) linearly with the number of nodes. Or at the very least going up at all! So my question is what else should I be looking at here? * CPU on the loadtest client is well under 100% * No other obvious bottlenecks on loadtest client (running 2 clients leads to ~1/2 qps on each) * In many cases, CPU on the solr servers is quite low as well (e.g. with 100 users hitting 5 solr nodes, all nodes are 50% idle) * Network bandwidth is a few MB/s, well under the gigabit capacity of our network * Disk bandwidth ( 2 MB/s) and iops ( 20/s) are low. Any ideas? Thanks very much! - Ian p.s. Here is my raw data broken out by number of nodes and number of simulated users: Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210 229021528502202900240321026032002803210210031803138535158031020903152560320 27603252890380305041375451560410220041525004202700425280043028505152450520 2640525279053028405100290052002810
Re: Ideas for debugging poor SolrCloud scalability
On 10/30/2014 2:23 PM, Ian Rose wrote: My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. If you increase replicationFactor instead, then each server can be doing a different query in parallel. Sharding the index is what you need to do when you need to scale the size of the index, so each server does not get overwhelmed by dealing with every document for every query. Getting a high QPS with a big index requires increasing both numShards *AND* replicationFactor. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/30/2014 2:23 PM, Ian Rose wrote: My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. If you increase replicationFactor instead, then each server can be doing a different query in parallel. Sharding the index is what you need to do when you need to scale the size of the index, so each server does not get overwhelmed by dealing with every document for every query. Getting a high QPS with a big index requires increasing both numShards *AND* replicationFactor. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
If you are issuing writes to shard non-leaders, then there is a large overhead for the eventual redirect to the leader. I noticed a 3-5 times performance increase by making my write client leader aware. On Oct 30, 2014, at 2:56 PM, Ian Rose ianr...@fullstory.com wrote: If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/30/2014 2:23 PM, Ian Rose wrote: My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. If you increase replicationFactor instead, then each server can be doing a different query in parallel. Sharding the index is what you need to do when you need to scale the size of the index, so each server does not get overwhelmed by dealing with every document for every query. Getting a high QPS with a big index requires increasing both numShards *AND* replicationFactor. Thanks, Shawn smime.p7s Description: S/MIME cryptographic signature
Re: Ideas for debugging poor SolrCloud scalability
On 10/30/2014 2:56 PM, Ian Rose wrote: I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Also, index many documents with each update request. Sending one document at a time is very inefficient. You didn't say how you're doing commits, but those need to be as infrequent as you can manage. Ideally, you would use autoCommit with openSearcher=false on an interval of about five minutes, and send an explicit commit (with the default openSearcher=true) after all the indexing is done. You may have requirements regarding document visibility that this won't satisfy, but try to avoid doing commits with openSearcher=true (soft commits qualify for this) extremely frequently, like once a second. Once a minute is much more realistic. Opening a new searcher is an expensive operation, especially if you have cache warming configured. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
Your indexing client, if written in SolrJ, should use CloudSolrServer which is, in Matt's terms leader aware. It divides up the documents to be indexed into packets that where each doc in the packet belongs on the same shard, and then sends the packet to the shard leader. This avoids a lot of re-routing and should scale essentially linearly. You may have to add more clients though, depending upon who hard the document-generator is working. Also, make sure that you send batches of documents as Shawn suggests, I use 1,000 as a starting point. Best, Erick On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/30/2014 2:56 PM, Ian Rose wrote: I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Also, index many documents with each update request. Sending one document at a time is very inefficient. You didn't say how you're doing commits, but those need to be as infrequent as you can manage. Ideally, you would use autoCommit with openSearcher=false on an interval of about five minutes, and send an explicit commit (with the default openSearcher=true) after all the indexing is done. You may have requirements regarding document visibility that this won't satisfy, but try to avoid doing commits with openSearcher=true (soft commits qualify for this) extremely frequently, like once a second. Once a minute is much more realistic. Opening a new searcher is an expensive operation, especially if you have cache warming configured. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
Thanks for the suggestions so for, all. 1) We are not using SolrJ on the client (not using Java at all) but I am working on writing a smart router so that we can always send to the correct node. I am certainly curious to see how that changes things. Nonetheless even with the overhead of extra routing hops, the observed behavior (no increase in performance with more nodes) doesn't make any sense to me. 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6) and autoSoftCommit (maxTime=15000). 3) Suggestions to batch documents certainly make sense for production code but in this case I am not real concerned with absolute performance; I just want to see the *relative* performance as we use more Solr nodes. So I don't think batching or not really matters. 4) No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Can you elaborate on this some? I'm worried I might be misunderstanding something fundamental. A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. Regarding your comment of increase the number of processes or threads, note that for each value of K (number of Solr nodes) I measured with several different numbers of simulated users so that I could find a saturation point. For example, take a look at my data for K=2: Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102 1003180 It's clear that once the load test client has ~40 simulated users, the Solr cluster is saturated. Creating more users just increases the average request latency, such that the total QPS remained (nearly) constant. So I feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps. The problem is that I am finding roughly this same max point, no matter how many simulated users the load test client created, for any value of K ( 1). Cheers, - Ian On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson erickerick...@gmail.com wrote: Your indexing client, if written in SolrJ, should use CloudSolrServer which is, in Matt's terms leader aware. It divides up the documents to be indexed into packets that where each doc in the packet belongs on the same shard, and then sends the packet to the shard leader. This avoids a lot of re-routing and should scale essentially linearly. You may have to add more clients though, depending upon who hard the document-generator is working. Also, make sure that you send batches of documents as Shawn suggests, I use 1,000 as a starting point. Best, Erick On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/30/2014 2:56 PM, Ian Rose wrote: I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Also, index many documents with each update request. Sending one document at a time is very inefficient. You didn't say how you're doing commits, but those need to be as infrequent as you can manage. Ideally, you would use autoCommit with openSearcher=false on an interval of about five minutes, and send an explicit commit (with the default openSearcher=true) after all the indexing is done. You may have requirements regarding document visibility that this won't satisfy, but try to avoid doing commits with openSearcher=true (soft commits qualify for this) extremely frequently, like once a second. Once a minute is much more realistic. Opening a new searcher is an expensive operation, especially if you have cache warming configured. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client. This set of operations will eventually degrade the scaling. bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. If we're talking search requests, the answer is no. Sharding is what you do when your collection no longer fits on a single node. If it _does_ fit on a single node, then you'll usually get better query performance by adding a bunch of replicas to a single shard. When the number of docs on each shard grows large enough that you no longer get good query performance, _then_ you shard. And take the query hit. If we're talking about inserts, then see above. I suspect your problem is that you're _not_ saturating the SolrCloud cluster, you're sending docs to Solr very inefficiently and waiting on I/O. Batching docs and sending them to the right leader should scale pretty linearly until you start saturating your network. Best, Erick On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote: Thanks for the suggestions so for, all. 1) We are not using SolrJ on the client (not using Java at all) but I am working on writing a smart router so that we can always send to the correct node. I am certainly curious to see how that changes things. Nonetheless even with the overhead of extra routing hops, the observed behavior (no increase in performance with more nodes) doesn't make any sense to me. 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6) and autoSoftCommit (maxTime=15000). 3) Suggestions to batch documents certainly make sense for production code but in this case I am not real concerned with absolute performance; I just want to see the *relative* performance as we use more Solr nodes. So I don't think batching or not really matters. 4) No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Can you elaborate on this some? I'm worried I might be misunderstanding something fundamental. A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. Regarding your comment of increase the number of processes or threads, note that for each value of K (number of Solr nodes) I measured with several different numbers of simulated users so that I could find a saturation point. For example, take a look at my data for K=2: Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102 1003180 It's clear that once the load test client has ~40 simulated users, the Solr cluster is saturated. Creating more users just increases the average request latency, such that the total QPS remained (nearly) constant. So I feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps. The problem is that I am finding roughly this same max
Spelling suggestions--any ideas?
Correctly spelled words are returning as not spelled correctly, with the original, correctly spelled word with a single oddball character appended as multiple suggestions... -- Ed Smiley, Senior Software Architect, eBooks ProQuest | 161 E Evelyn Ave| Mountain View, CA 94041 | USA | +1 650 475 8700 extension 3772 ed.smi...@proquest.com www.proquest.comhttp://www.proquest.com/ | www.ebrary.comhttp://www.ebrary.com/ | www.eblib.comhttp://www.eblib.com/ ebrary and EBL, ProQuest businesses.
Need ideas to perform historical search
I am trying to implement Historical search using SOLR. Ex: If I search on address 800 5th Ave and provide a time range, it should list the name of the person who was living at the address during the time period. I am trying to figure out a way to store the data without redundancy. I can do a join in the database to return all the names who were living in a particular address during a particular time but I know it's difficult to do that in SOLR and SOLR is not a database (it works best when the data is denormalized).,.. Is there any other way / idea by which I can reduce the redundancy of creating multiple records for a particular person again and again? -- View this message in context: http://lucene.472066.n3.nabble.com/Need-ideas-to-perform-historical-search-tp4078980.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need ideas to perform historical search
Why do you care about redundancy? That's the search engine's architectural tradeoff (as far as I understand). And, the tokens are all normalized under the covers, so it does not take as much space as you expect. Specifically regarding your issue, maybe you should store 'occupancy' as the record. That's similar to what they do at Gilt: http://www.slideshare.net/trenaman/personalized-search-on-the-largest-flash-sale-site-in-america(slide 36+) The other option is to use location as spans with some clever queries: http://wiki.apache.org/solr/SpatialForTimeDurations (follow the links). Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 18, 2013 at 5:58 PM, SolrLover bbar...@gmail.com wrote: I am trying to implement Historical search using SOLR. Ex: If I search on address 800 5th Ave and provide a time range, it should list the name of the person who was living at the address during the time period. I am trying to figure out a way to store the data without redundancy. I can do a join in the database to return all the names who were living in a particular address during a particular time but I know it's difficult to do that in SOLR and SOLR is not a database (it works best when the data is denormalized).,.. Is there any other way / idea by which I can reduce the redundancy of creating multiple records for a particular person again and again? -- View this message in context: http://lucene.472066.n3.nabble.com/Need-ideas-to-perform-historical-search-tp4078980.html Sent from the Solr - User mailing list archive at Nabble.com.
ScorerDocQueue.java's downHeap showing up as frequent hotspot in profiling - ideas why?
Greetings, In a recent batch of solr 3.6.1 slow response time queries the profiler highlighted downHeap (line 212) in SoorerDocQueue.java as averaging more than 60ms across the 16 calls I was looking at and showing it spiking up over 100ms - which, after looking at the code (two int comparisons?!?) I am at a loss to explain: Here's the source: https://github.com/apache/lucene-solr/blob/6b8783bfa59351878c59e47deaa7739d95150a22/lucene/core/src/java/org/apache/lucene/util/ScorerDocQueue.java#L212 Here's the invocation trace of one of the many similar: ---snip--- Thread.run:722 (0ms self time, 416 ms total time) QueuedThreadPool$3.run:526 (0ms self time, 416 ms total time) QueuedThreadPool.runJob:595 (0ms self time, 416 ms total time) ExecutorCallback$ExecutorCallbackInvoker.run:130 (0ms self time, 416 ms total time) ExecutorCallback$ExecutorCallbackInvoker.call:124 (0ms self time, 416 ms total time) AbstractConnection$1.onCompleted:63 (0ms self time, 416 ms total time) AbstractConnection$1.onCompleted:71 (0ms self time, 416 ms total time) HttpConnection.onFillable:253 (0ms self time, 416 ms total time) HttpChannel.run:246 (0ms self time, 416 ms total time) Server.handle:403 (0ms self time, 416 ms total time) HandlerWrapper.handle:97 (0ms self time, 416 ms total time) IPAccessHandler.handle:204 (0ms self time, 416 ms total time) HandlerCollection.handle:110 (0ms self time, 416 ms total time) ContextHandlerCollection.handle:258 (0ms self time, 416 ms total time) ScopedHandler.handle:136 (0ms self time, 416 ms total time) ContextHandler.doScope:973 (0ms self time, 416 ms total time) SessionHandler.doScope:174 (0ms self time, 416 ms total time) ServletHandler.doScope:358 (0ms self time, 416 ms total time) ContextHandler.doHandle:1044 (0ms self time, 416 ms total time) SessionHandler.doHandle:213 (0ms self time, 416 ms total time) SecurityHandler.handle:540 (0ms self time, 416 ms total time) ScopedHandler.handle:138 (0ms self time, 416 ms total time) ServletHandler.doHandle:429 (0ms self time, 416 ms total time) ServletHandler$CachedChain.doFilter:1274 (0ms self time, 416 ms total time) SolrDispatchFilter.doFilter:260 (0ms self time, 416 ms total time) SolrDispatchFilter.execute:365 (0ms self time, 416 ms total time) SolrCore.execute:1376 (0ms self time, 416 ms total time) RequestHandlerBase.handleRequest:129 (0ms self time, 416 ms total time) SearchHandler.handleRequestBody:186 (0ms self time, 416 ms total time) QueryComponent.process:394 (0ms self time, 416 ms total time) SolrIndexSearcher.search:375 (0ms self time, 416 ms total time) SolrIndexSearcher.getDocListC:1176 (0ms self time, 416 ms total time) SolrIndexSearcher.getDocListNC:1296 (0ms self time, 416 ms total time) IndexSearcher.search:364 (0ms self time, 416 ms total time) IndexSearcher.search:581 (0ms self time, 416 ms total time) FilteredQuery$2.score:169 (0ms self time, 416 ms total time) BooleanScorer2.advance:320 (0ms self time, 416 ms total time) ReqExclScorer.advance:112 (0ms self time, 416 ms total time) DisjunctionSumScorer.advance:229 (52ms self time, 416 ms total time) DisjunctionSumScorer.advanceAfterCurrent:171 (0ms self time, 308 ms total time) ScorerDocQueue.topNextAndAdjustElsePop:120 (0ms self time, 308 ms total time) ScorerDocQueue.checkAdjustElsePop:135 (0ms self time, 111 ms total time) ScorerDocQueue.downHeap:212 (111ms self time, 111 ms total time) ---snip--- Any ideas on what is causing this seemingly inordinate amount of time in downHeap? Is this symptomatic of anything in particular? Thanks, as always! Aaron
Any ideas on Solr 4.0 Release.
Hi, Congratulations on Alpha release. I am wondering is there a ball park on final release for 4.0? Is it expected in August or Sep time frame or is it further away? We badly need some features included in this release. These are around grouped facet counts. We have limited use for Solr in our current release. In next release, we will add more features (full text searching, location based searches etc.). I am wondering if the facet and group counts side of things is stable in Alpha or not? I have tested with the nightly builds before and it works fine for our scenarios. Thanks. Regards, Sohail
RE: Any ideas on Solr 4.0 Release.
Hi Sohail, Some of your questions are answered here: http://wiki.apache.org/solr/Solr4.0. See Chris Hostetter's blog post for more info, particularly on questions around stability: http://www.lucidimagination.com/blog/2012/07/03/4-0-alpha-whats-in-a-name/. Steve -Original Message- From: Sohail Aboobaker [mailto:sabooba...@gmail.com] Sent: Thursday, July 05, 2012 5:22 AM To: solr-user@lucene.apache.org Subject: Any ideas on Solr 4.0 Release. Hi, Congratulations on Alpha release. I am wondering is there a ball park on final release for 4.0? Is it expected in August or Sep time frame or is it further away? We badly need some features included in this release. These are around grouped facet counts. We have limited use for Solr in our current release. In next release, we will add more features (full text searching, location based searches etc.). I am wondering if the facet and group counts side of things is stable in Alpha or not? I have tested with the nightly builds before and it works fine for our scenarios. Thanks. Regards, Sohail
Re: Strange spikes in query response times...any ideas where else to look?
Otis, Thanks for the response. We'll check out that tool and see how it goes. Regarding JMeter...you are exactly correct in that I was assuming 1 thread = 1 query per second. I thought we had set up some sort of throttling mechanism to ensure that...and clearly I was mistaken. By the math we are getting A LOT more qps...and in a preliminary look those spikes look like they just might be correlated to high qps. We are pursuing this line and my gut tells me this *is* the problem. Thanks for the info on the tool (which we will look at) and for the heads-up on the qps. Peter Lee ProQuest Quoting Otis Gospodnetic otis_gospodne...@yahoo.com: Peter, These could be JVM, or it could be index reopening and warmup queries, or Grab SPM for Solr - http://sematext.com/spm - in 24-48h we'll release an agent that tracks and graphs errors and timings of each Solr search component, which may reveal interesting stuff. In the mean time, look at the graph with IO as well as graph with caches. That's where I'd first look for signs. Re users/threads question - if I understand correctly, this is the problem: JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second. It sounds like you re equating # of threads to QPS, which isn't right. Imagine you had 10 threads and each query took 0.1 seconds (processed by a single CPU core) and the server had 10 CPU cores. That would mean that your 1 thread could run 10 queries per second utilizing just 1 CPU core. And 10 threads would utilize all 10 CPU cores and would give you 10x higher throughput - 10x10=100 QPS. So if you need to simulate just 2-5 QPS, just lower the number of threads. What that number should be depends on query complexity and hw resources (cores or IO). Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: s...@isshomefront.com s...@isshomefront.com To: solr-user@lucene.apache.org Sent: Thursday, June 28, 2012 9:20 PM Subject: RE: Strange spikes in query response times...any ideas where else to look? Michael, Thank you for responding...and for the excellent questions. 1) We have never seen this response time spike with a user-interactive search. However, in the span of about 40 minutes, which included about 82,000 queries, we only saw a handful of near-equally distributed spikes. We have tried sending queries from the admin tool while the test was running, but given those odds, I'm not surprised we've never hit on one of those few spikes we are seeing in the test results. 2) Good point and I should have mentioned this. We are using multiple methods to track these response times. a) Looking at the catalina.out file and plotting the response times recorded there (I think this is logging the QTime as seen by Solr). b) Looking at what JMeter is reporting as response times. In general, these are very close if not identical to what is being seen in the Catalina.out file. I have not run a line-by-line comparison, but putting the query response graphs next to each other shows them to be nearly (or possibly exactly) the same. Nothing looked out of the ordinary. 3) We are using multiple threads. Before your email I was looking at the results, doing some math, and double checking the reports from JMeter. I did notice that our throughput is much higher than we meant for it to be. JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second. We are only targeting TWO to FIVE queries per second. This is up next on our list of things to look at and how to control more effectively. We do have three separate machines set up for JMeter testing and we are investigating to see if perhaps all three of these machines are inadvertently being launched during the test at one time and overwhelming the server. This *might* be one facet of the problem. Agreed on that. Even as we investigate this last item regarding the number of users/threads, I wouldn't mind any other thoughts you or anyone else had to offer. We are checking on this user/threads issue and for the sake of anyone else you finds this discussion useful I'll note what we find. Thanks again. Peter S. Lee ProQuest Quoting Michael Ryan mr...@moreover.com: A few questions... 1) Do you only see these spikes when running JMeter? I.e., do you ever see a spike when you manually run a query? 2) How are you measuring the response time? In my experience there are three different ways to measure query speed. Usually all of them will be approximately equal, but in some situations they can be quite different, and this difference can be a clue as to where the bottleneck is: 1) The response time as seen by the end user (in this case
Strange spikes in query response times...any ideas where else to look?
Greetings all, We are working on building up a large Solr index for over 300 million records...and this is our first look at Solr. We are currently running a set of unique search queries against a single server (so no replication, no indexing going on at the same time, and no distributed search) with a set number of records (in our case, 10 million records in the index) for about 30 minutes, with nearly all of our searches being unique (I say nearly because our set of queries is unique, but I have not yet confirmed that JMeter is selecting these queries with no replacement). We are striving for a 2 second response time on the average, and indeed we are pretty darned close. In fact, if you look at the average responses time, we are well under the 2 seconds per query. Unfortunately, we are seeing that about once every 6 minutes or so (and it is not a regular event...exactly six minutes apart...it is about six minutes but it fluctuates) we get a single query that returns in something like 15 to 20 seconds We have been trying to identify what is causing this spike every so often and we are completely baffled. What we have done thus far: 1) Looked through the SAR logs and have not seen anything that correlates to this issue 2) Tracked the JVM statistics...especially the garbage collections...no correlations there either 3) Examined the queries...no pattern obvious there 4) Played with the JVM memory settings (heap settings, cache settings, and any other settings we could find) 5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a fresh install of Redhat 5.7 enterprise, tried on a large instance of AWS EC2, tried on a fresh instance of a VMWare based virtual machine from our own data center) an still nothing is giving us a clue as to what is causing these spikes 5) No correlation found between the number of hits returned and the spikes Our data is very simple and so are the queries. The schema consists of 40 fields, most of which are string fields, 2 of which are location fields, and a small handful of which are integer fields. All fields are indexed and all fields are stored. Our queries are also rather simple. Many of the queries are a simple one-field search. The most complex query we have is a 3-field search. Again, no correlation has been established between the query and these spikes. Also, about 60% of our queries return zero hits (on the assumption that we want to make solr search its entire index every so often. 60% is more than we intended and we will fix that soon...but that is what is currently happening. Again, no correlation found between spikes and 0-hit returned queries). For some time we were testing with 100 million records in the index and the aggregate data looked quite good. Most queries were returning in under 2 seconds. Unfortunately, it was when we looked at the individual data points that we found spikes every 6-8 minutes or so hitting sometimes as high as 150 seconds! We have been testing with 100 million records in the index, 50 million records in the index, 25 million, 20 million, 15 million, and 10 million records. As I indicated at the start, we are now at 10 million records with 15-20 seconds spikes. As we have decreased the number of records in the index,the size (but not the frequency) of the spikes has been dropping. My question is: Is this type of behavior normal for Solr when it is being overstressed? I've read of lots of people with far more complicated schemas running MORE than 10 million records in an index and never once complained about these spikes. Since I am new at this, I am not sure what Solr's failure mode looks like when it has too many records to search. I am hoping someone looking at this note can at least give me another direction to look. 10 million records searched in less than 2 seconds most of the time is great...but those 10 and 20 seconds spikes are not going to go over well with our customers...and I somehow think there is more we should be able to do here. Thanks. Peter S. Lee ProQuest
RE: Strange spikes in query response times...any ideas where else to look?
A few questions... 1) Do you only see these spikes when running JMeter? I.e., do you ever see a spike when you manually run a query? 2) How are you measuring the response time? In my experience there are three different ways to measure query speed. Usually all of them will be approximately equal, but in some situations they can be quite different, and this difference can be a clue as to where the bottleneck is: 1) The response time as seen by the end user (in this case, JMeter) 2) The response time as seen by the container (for example, in Jetty you can get this by enabling logLatency in jetty.xml) 3) The QTime as returned in the Solr response 3) Are you running multiple queries concurrently, or are you just using a single thread in JMeter? -Michael -Original Message- From: s...@isshomefront.com [mailto:s...@isshomefront.com] Sent: Thursday, June 28, 2012 7:56 PM To: solr-user@lucene.apache.org Subject: Strange spikes in query response times...any ideas where else to look? Greetings all, We are working on building up a large Solr index for over 300 million records...and this is our first look at Solr. We are currently running a set of unique search queries against a single server (so no replication, no indexing going on at the same time, and no distributed search) with a set number of records (in our case, 10 million records in the index) for about 30 minutes, with nearly all of our searches being unique (I say nearly because our set of queries is unique, but I have not yet confirmed that JMeter is selecting these queries with no replacement). We are striving for a 2 second response time on the average, and indeed we are pretty darned close. In fact, if you look at the average responses time, we are well under the 2 seconds per query. Unfortunately, we are seeing that about once every 6 minutes or so (and it is not a regular event...exactly six minutes apart...it is about six minutes but it fluctuates) we get a single query that returns in something like 15 to 20 seconds We have been trying to identify what is causing this spike every so often and we are completely baffled. What we have done thus far: 1) Looked through the SAR logs and have not seen anything that correlates to this issue 2) Tracked the JVM statistics...especially the garbage collections...no correlations there either 3) Examined the queries...no pattern obvious there 4) Played with the JVM memory settings (heap settings, cache settings, and any other settings we could find) 5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a fresh install of Redhat 5.7 enterprise, tried on a large instance of AWS EC2, tried on a fresh instance of a VMWare based virtual machine from our own data center) an still nothing is giving us a clue as to what is causing these spikes 5) No correlation found between the number of hits returned and the spikes Our data is very simple and so are the queries. The schema consists of 40 fields, most of which are string fields, 2 of which are location fields, and a small handful of which are integer fields. All fields are indexed and all fields are stored. Our queries are also rather simple. Many of the queries are a simple one-field search. The most complex query we have is a 3-field search. Again, no correlation has been established between the query and these spikes. Also, about 60% of our queries return zero hits (on the assumption that we want to make solr search its entire index every so often. 60% is more than we intended and we will fix that soon...but that is what is currently happening. Again, no correlation found between spikes and 0-hit returned queries). For some time we were testing with 100 million records in the index and the aggregate data looked quite good. Most queries were returning in under 2 seconds. Unfortunately, it was when we looked at the individual data points that we found spikes every 6-8 minutes or so hitting sometimes as high as 150 seconds! We have been testing with 100 million records in the index, 50 million records in the index, 25 million, 20 million, 15 million, and 10 million records. As I indicated at the start, we are now at 10 million records with 15-20 seconds spikes. As we have decreased the number of records in the index,the size (but not the frequency) of the spikes has been dropping. My question is: Is this type of behavior normal for Solr when it is being overstressed? I've read of lots of people with far more complicated schemas running MORE than 10 million records in an index and never once complained about these spikes. Since I am new at this, I am not sure what Solr's failure mode looks like when it has too many records to search. I am hoping someone looking at this note can at least give me another direction to look. 10 million records searched in less than 2 seconds most of the time is great...but those 10 and 20 seconds
RE: Strange spikes in query response times...any ideas where else to look?
Michael, Thank you for responding...and for the excellent questions. 1) We have never seen this response time spike with a user-interactive search. However, in the span of about 40 minutes, which included about 82,000 queries, we only saw a handful of near-equally distributed spikes. We have tried sending queries from the admin tool while the test was running, but given those odds, I'm not surprised we've never hit on one of those few spikes we are seeing in the test results. 2) Good point and I should have mentioned this. We are using multiple methods to track these response times. a) Looking at the catalina.out file and plotting the response times recorded there (I think this is logging the QTime as seen by Solr). b) Looking at what JMeter is reporting as response times. In general, these are very close if not identical to what is being seen in the Catalina.out file. I have not run a line-by-line comparison, but putting the query response graphs next to each other shows them to be nearly (or possibly exactly) the same. Nothing looked out of the ordinary. 3) We are using multiple threads. Before your email I was looking at the results, doing some math, and double checking the reports from JMeter. I did notice that our throughput is much higher than we meant for it to be. JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second. We are only targeting TWO to FIVE queries per second. This is up next on our list of things to look at and how to control more effectively. We do have three separate machines set up for JMeter testing and we are investigating to see if perhaps all three of these machines are inadvertently being launched during the test at one time and overwhelming the server. This *might* be one facet of the problem. Agreed on that. Even as we investigate this last item regarding the number of users/threads, I wouldn't mind any other thoughts you or anyone else had to offer. We are checking on this user/threads issue and for the sake of anyone else you finds this discussion useful I'll note what we find. Thanks again. Peter S. Lee ProQuest Quoting Michael Ryan mr...@moreover.com: A few questions... 1) Do you only see these spikes when running JMeter? I.e., do you ever see a spike when you manually run a query? 2) How are you measuring the response time? In my experience there are three different ways to measure query speed. Usually all of them will be approximately equal, but in some situations they can be quite different, and this difference can be a clue as to where the bottleneck is: 1) The response time as seen by the end user (in this case, JMeter) 2) The response time as seen by the container (for example, in Jetty you can get this by enabling logLatency in jetty.xml) 3) The QTime as returned in the Solr response 3) Are you running multiple queries concurrently, or are you just using a single thread in JMeter? -Michael -Original Message- From: s...@isshomefront.com [mailto:s...@isshomefront.com] Sent: Thursday, June 28, 2012 7:56 PM To: solr-user@lucene.apache.org Subject: Strange spikes in query response times...any ideas where else to look? Greetings all, We are working on building up a large Solr index for over 300 million records...and this is our first look at Solr. We are currently running a set of unique search queries against a single server (so no replication, no indexing going on at the same time, and no distributed search) with a set number of records (in our case, 10 million records in the index) for about 30 minutes, with nearly all of our searches being unique (I say nearly because our set of queries is unique, but I have not yet confirmed that JMeter is selecting these queries with no replacement). We are striving for a 2 second response time on the average, and indeed we are pretty darned close. In fact, if you look at the average responses time, we are well under the 2 seconds per query. Unfortunately, we are seeing that about once every 6 minutes or so (and it is not a regular event...exactly six minutes apart...it is about six minutes but it fluctuates) we get a single query that returns in something like 15 to 20 seconds We have been trying to identify what is causing this spike every so often and we are completely baffled. What we have done thus far: 1) Looked through the SAR logs and have not seen anything that correlates to this issue 2) Tracked the JVM statistics...especially the garbage collections...no correlations there either 3) Examined the queries...no pattern obvious there 4) Played with the JVM memory settings (heap settings, cache settings, and any other settings we could find) 5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a fresh install of Redhat 5.7 enterprise, tried on a large instance of AWS EC2, tried on a fresh instance of a VMWare
Re: Strange spikes in query response times...any ideas where else to look?
Peter, These could be JVM, or it could be index reopening and warmup queries, or Grab SPM for Solr - http://sematext.com/spm - in 24-48h we'll release an agent that tracks and graphs errors and timings of each Solr search component, which may reveal interesting stuff. In the mean time, look at the graph with IO as well as graph with caches. That's where I'd first look for signs. Re users/threads question - if I understand correctly, this is the problem: JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second. It sounds like you re equating # of threads to QPS, which isn't right. Imagine you had 10 threads and each query took 0.1 seconds (processed by a single CPU core) and the server had 10 CPU cores. That would mean that your 1 thread could run 10 queries per second utilizing just 1 CPU core. And 10 threads would utilize all 10 CPU cores and would give you 10x higher throughput - 10x10=100 QPS. So if you need to simulate just 2-5 QPS, just lower the number of threads. What that number should be depends on query complexity and hw resources (cores or IO). Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: s...@isshomefront.com s...@isshomefront.com To: solr-user@lucene.apache.org Sent: Thursday, June 28, 2012 9:20 PM Subject: RE: Strange spikes in query response times...any ideas where else to look? Michael, Thank you for responding...and for the excellent questions. 1) We have never seen this response time spike with a user-interactive search. However, in the span of about 40 minutes, which included about 82,000 queries, we only saw a handful of near-equally distributed spikes. We have tried sending queries from the admin tool while the test was running, but given those odds, I'm not surprised we've never hit on one of those few spikes we are seeing in the test results. 2) Good point and I should have mentioned this. We are using multiple methods to track these response times. a) Looking at the catalina.out file and plotting the response times recorded there (I think this is logging the QTime as seen by Solr). b) Looking at what JMeter is reporting as response times. In general, these are very close if not identical to what is being seen in the Catalina.out file. I have not run a line-by-line comparison, but putting the query response graphs next to each other shows them to be nearly (or possibly exactly) the same. Nothing looked out of the ordinary. 3) We are using multiple threads. Before your email I was looking at the results, doing some math, and double checking the reports from JMeter. I did notice that our throughput is much higher than we meant for it to be. JMeter is set up to run 15 threads from a single test machine...but I noticed that the JMeter report is showing close to 47 queries per second. We are only targeting TWO to FIVE queries per second. This is up next on our list of things to look at and how to control more effectively. We do have three separate machines set up for JMeter testing and we are investigating to see if perhaps all three of these machines are inadvertently being launched during the test at one time and overwhelming the server. This *might* be one facet of the problem. Agreed on that. Even as we investigate this last item regarding the number of users/threads, I wouldn't mind any other thoughts you or anyone else had to offer. We are checking on this user/threads issue and for the sake of anyone else you finds this discussion useful I'll note what we find. Thanks again. Peter S. Lee ProQuest Quoting Michael Ryan mr...@moreover.com: A few questions... 1) Do you only see these spikes when running JMeter? I.e., do you ever see a spike when you manually run a query? 2) How are you measuring the response time? In my experience there are three different ways to measure query speed. Usually all of them will be approximately equal, but in some situations they can be quite different, and this difference can be a clue as to where the bottleneck is: 1) The response time as seen by the end user (in this case, JMeter) 2) The response time as seen by the container (for example, in Jetty you can get this by enabling logLatency in jetty.xml) 3) The QTime as returned in the Solr response 3) Are you running multiple queries concurrently, or are you just using a single thread in JMeter? -Michael -Original Message- From: s...@isshomefront.com [mailto:s...@isshomefront.com] Sent: Thursday, June 28, 2012 7:56 PM To: solr-user@lucene.apache.org Subject: Strange spikes in query response times...any ideas where else to look? Greetings all, We are working on building up a large Solr index for over 300 million records...and this is our first look at Solr. We are currently running a set of unique search
RE: ideas for indexing large amount of pdf docs
Hi Jay, thanks. great idea, in next days we'll try to do something like you'd exposed. best, rode. --- Rode González Libnova, SL Paseo de la Castellana, 153-Madrid [t]91 449 08 94 [f]91 141 21 21 www.libnova.es -Mensaje original- De: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Enviado el: lunes, 15 de agosto de 2011 14:54 Para: solr-user@lucene.apache.org Asunto: RE: ideas for indexing large amount of pdf docs Note on i: Solr replication provides pretty good clustering support out-of-the-box, including replication of multiple cores. Read the Wiki on replication (Google +solr +replication if you don't know where it is). In my experience, the problem with indexing PDFs is it takes a lot of CPU on the document parsing side (client), not on the Solr server side. So make sure you do that part on the client and not the server. Avoiding iii: Suggest that you write yourself a multi-threaded performance test so that you aren't guessing what your performance will be. We wrote one in Perl. It handles an individual thread (we were testing inquiry), and we wrote a little batch file / shell script to start up the desired number of threads. The main statement in our batch file (the rest just set the variables). A Shell script would be even easier. for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h %SOLRHOST% -c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l %SOLRLOC%\lastsynonyms.txt -z %FUZZ% The perl #!/usr/bin/perl # # Perl program to run a thread of solr testing # use Getopt::Std; # For options processing use POSIX;# For time formatting use XML::Simple; # For processing of XML config file use Data::Dumper; # For debugging XML config file use HTTP::Request::Common;# For HTTP request to Solr use HTTP::Response; use LWP::UserAgent; # For HTTP request to Solr $host = YOURHOST:8983; $realm = YOUR AUTHENTICATION REALM; $firstlist = firstsynonyms.txt; $lastlist = lastsynonyms.txt; $fuzzy = ; $me = $0; sub usage() { print perl $me -c iterations [-d] [-h host:port ] [-u user [-p password]] \n; print \t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r realm]\n; exit(8); } # # Process the command line options, and open the output file. # getopts('dc:u:p:f:l:h:r:z:') || usage(); if(!$opt_c) { usage(); } $count = $opt_c; if($opt_u) { $user = $opt_u; } if($opt_p) { $password = $opt_p; } if($opt_h) { $host = $opt_h; } if($opt_f) { $firstlist = $opt_f; } if($opt_l) { $lastlist = $opt_l; } if($opt_r) { $realm = $opt_r; } if($opt_z) { $fuzzy = ~ . $opt_z; } $debug = $opt_d; # # If the host string does not include a :, add :80 # if($host !~ /:/) { $host = $host . :80; } # # Read the lists of first and last names # open(SYNFILE,$firstlist) || die Can't open first name list $firstlist\n; while(SYNFILE) { @newwords = split /,/; for($i=0; $i = $#newwords; ++$i) { $newwords[$i] =~ s/^\s+//; $newwords[$i] =~ s/\s+$//; $newwords[$i] = lc($newwords[$i]); } push @firstnames, @newwords; } close(SYNFILE); open(SYNFILE,$lastlist) || die Can't open last name list $lastlist\n; while(SYNFILE) { @newwords = split /,/; for($i=0; $i = $#newwords; ++$i) { $newwords[$i] =~ s/^\s+//; $newwords[$i] =~ s/\s+$//; $newwords[$i] = lc($newwords[$i]); } push @lastnames, @newwords; } close(SYNFILE); print $#firstnames First Names, $#lastnames Last Names\n; print User: $user\n; my $userAgent = LWP::UserAgent-new(agent = 'solrtest.pl'); $userAgent-credentials($host,$realm,$user,$password); $uri = http://$host/solr/select;; $starttime = time(); for($c=0; $c $count; ++$c) { $fname = $firstnames[rand $#firstnames]; $lname = $lastnames[rand $#lastnames]; $response = $userAgent-request( POST $uri, [ q = lnamesyn:$lname AND fnamesyn:$fname$fuzzy, rows = 25 ]); if($debug) { print Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy; print $response-content(); } print POST for $fname $lname completed, HTTP status= . $response-code . \n; } $elapsed = time() - $starttime; $average = $elapsed / $count; print Time: $elapsed s ($average/request)\n; -Original Message- From: Rode Gonzalez (libnova) [mailto:r...@libnova.es] Sent: Saturday, August 13, 2011 3:50 AM To: solr-user@lucene.apache.org Subject: ideas for indexing large amount of pdf docs Hi all, I want to ask about the best way to implement
RE: ideas for indexing large amount of pdf docs
Note on i: Solr replication provides pretty good clustering support out-of-the-box, including replication of multiple cores. Read the Wiki on replication (Google +solr +replication if you don't know where it is). In my experience, the problem with indexing PDFs is it takes a lot of CPU on the document parsing side (client), not on the Solr server side. So make sure you do that part on the client and not the server. Avoiding iii: Suggest that you write yourself a multi-threaded performance test so that you aren't guessing what your performance will be. We wrote one in Perl. It handles an individual thread (we were testing inquiry), and we wrote a little batch file / shell script to start up the desired number of threads. The main statement in our batch file (the rest just set the variables). A Shell script would be even easier. for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h %SOLRHOST% -c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l %SOLRLOC%\lastsynonyms.txt -z %FUZZ% The perl #!/usr/bin/perl # # Perl program to run a thread of solr testing # use Getopt::Std;# For options processing use POSIX; # For time formatting use XML::Simple;# For processing of XML config file use Data::Dumper; # For debugging XML config file use HTTP::Request::Common; # For HTTP request to Solr use HTTP::Response; use LWP::UserAgent; # For HTTP request to Solr $host = YOURHOST:8983; $realm = YOUR AUTHENTICATION REALM; $firstlist = firstsynonyms.txt; $lastlist = lastsynonyms.txt; $fuzzy = ; $me = $0; sub usage() { print perl $me -c iterations [-d] [-h host:port ] [-u user [-p password]] \n; print \t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r realm]\n; exit(8); } # # Process the command line options, and open the output file. # getopts('dc:u:p:f:l:h:r:z:') || usage(); if(!$opt_c) { usage(); } $count = $opt_c; if($opt_u) { $user = $opt_u; } if($opt_p) { $password = $opt_p; } if($opt_h) { $host = $opt_h; } if($opt_f) { $firstlist = $opt_f; } if($opt_l) { $lastlist = $opt_l; } if($opt_r) { $realm = $opt_r; } if($opt_z) { $fuzzy = ~ . $opt_z; } $debug = $opt_d; # # If the host string does not include a :, add :80 # if($host !~ /:/) { $host = $host . :80; } # # Read the lists of first and last names # open(SYNFILE,$firstlist) || die Can't open first name list $firstlist\n; while(SYNFILE) { @newwords = split /,/; for($i=0; $i = $#newwords; ++$i) { $newwords[$i] =~ s/^\s+//; $newwords[$i] =~ s/\s+$//; $newwords[$i] = lc($newwords[$i]); } push @firstnames, @newwords; } close(SYNFILE); open(SYNFILE,$lastlist) || die Can't open last name list $lastlist\n; while(SYNFILE) { @newwords = split /,/; for($i=0; $i = $#newwords; ++$i) { $newwords[$i] =~ s/^\s+//; $newwords[$i] =~ s/\s+$//; $newwords[$i] = lc($newwords[$i]); } push @lastnames, @newwords; } close(SYNFILE); print $#firstnames First Names, $#lastnames Last Names\n; print User: $user\n; my $userAgent = LWP::UserAgent-new(agent = 'solrtest.pl'); $userAgent-credentials($host,$realm,$user,$password); $uri = http://$host/solr/select;; $starttime = time(); for($c=0; $c $count; ++$c) { $fname = $firstnames[rand $#firstnames]; $lname = $lastnames[rand $#lastnames]; $response = $userAgent-request( POST $uri, [ q = lnamesyn:$lname AND fnamesyn:$fname$fuzzy, rows = 25 ]); if($debug) { print Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy; print $response-content(); } print POST for $fname $lname completed, HTTP status= . $response-code . \n; } $elapsed = time() - $starttime; $average = $elapsed / $count; print Time: $elapsed s ($average/request)\n; -Original Message- From: Rode Gonzalez (libnova) [mailto:r...@libnova.es] Sent: Saturday, August 13, 2011 3:50 AM To: solr-user@lucene.apache.org Subject: ideas for indexing large amount of pdf docs Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce
ideas for indexing large amount of pdf docs
Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
Re: ideas for indexing large amount of pdf docs
Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
Re: ideas for indexing large amount of pdf docs
Hi Erick, Our app insert the pdf from a backoffice site and the people can search/consult throught a front end site. Both written in php. I've installed a tomcat for solr exclusivelly. the pdf docs are indexed and not stored using the standard solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars included in contrib/extraction dir, you know) in an offline mode (summarizing: the internal users submit the docs; this docs were saved in the server; there is a task that take the docs and put them into the indexer throught a curl utility; when the task finish, the doc is available to the frontend; once more, we use curl utilities to make queries to solr). The problem isn't the process of indexing. The max injection rate can be 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i don't know exactly... but a lot of them,so many books in a library. But no problem about this, this part of the process runs offline. take a doc, index a doc; take another doc, index another doc, ... The problem is the response time when the number of pdf's grow and grow... How is the better manner, the best way, the fantastic idea to minimize this time all as possible when we entering in production time. Best, Rode. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
Re: ideas for indexing large amount of pdf docs
You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson erickerick...@gmail.com wrote: Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
Re: ideas for indexing large amount of pdf docs
Ahhh, ok, my reply was irrelevant G... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr But Solr handles millions of documents on a single server in many cases, so waiting until the search app falls over is actually feasible. In general, if you can get an adequate query response time from a single machine, you just set up a master/slave architecture and add as many slaves as you need to handle your maximum load. So scaling wide is a very quick process. Don't go to sharding unless and until your machine can't give adequate response times at all... Mark's paper outlines this very well. Best Erick On Sat, Aug 13, 2011 at 2:13 PM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi Erick, Our app insert the pdf from a backoffice site and the people can search/consult throught a front end site. Both written in php. I've installed a tomcat for solr exclusivelly. the pdf docs are indexed and not stored using the standard solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars included in contrib/extraction dir, you know) in an offline mode (summarizing: the internal users submit the docs; this docs were saved in the server; there is a task that take the docs and put them into the indexer throught a curl utility; when the task finish, the doc is available to the frontend; once more, we use curl utilities to make queries to solr). The problem isn't the process of indexing. The max injection rate can be 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i don't know exactly... but a lot of them,so many books in a library. But no problem about this, this part of the process runs offline. take a doc, index a doc; take another doc, index another doc, ... The problem is the response time when the number of pdf's grow and grow... How is the better manner, the best way, the fantastic idea to minimize this time all as possible when we entering in production time. Best, Rode. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
Re: ideas for indexing large amount of pdf docs
Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I will study the master/slave architecture for many slaves. In the future perhaps we will need it =) Best regards, Rode. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 15:34:19 -0400 Subject: Re: ideas for indexing large amount of pdf docs Ahhh, ok, my reply was irrelevant G... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr [http://www.lucidimagination.com/content/scaling-lucene-and-solr] But Solr handles millions of documents on a single server in many cases, so waiting until the search app falls over is actually feasible. In general, if you can get an adequate query response time from a single machine, you just set up a master/slave architecture and add as many slaves as you need to handle your maximum load. So scaling wide is a very quick process. Don't go to sharding unless and until your machine can't give adequate response times at all... Mark's paper outlines this very well. Best Erick On Sat, Aug 13, 2011 at 2:13 PM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi Erick, Our app insert the pdf from a backoffice site and the people can search/consult throught a front end site. Both written in php. I've installed a tomcat for solr exclusivelly. the pdf docs are indexed and not stored using the standard solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars included in contrib/extraction dir, you know) in an offline mode (summarizing: the internal users submit the docs; this docs were saved in the server; there is a task that take the docs and put them into the indexer throught a curl utility; when the task finish, the doc is available to the frontend; once more, we use curl utilities to make queries to solr). The problem isn't the process of indexing. The max injection rate can be 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i don't know exactly... but a lot of them,so many books in a library. But no problem about this, this part of the process runs offline. take a doc, index a doc; take another doc, index another doc, ... The problem is the response time when the number of pdf's grow and grow... How is the better manner, the best way, the fantastic idea to minimize this time all as possible when we entering in production time. Best, Rode. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
ideas for versioning query?
A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Re: ideas for versioning query?
Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress but still not committed to the trunk). I would be easier from the Solr side if you could do something at index time, like indicating which document is the current one and which one is an old one (you would need to update the old document whenever a new version is indexed). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolov soko...@ifactory.com wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Re: ideas for versioning query?
Thanks, Tomas. Yes we are planning to keep a current flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/FieldCollapsing and it seems as if it will do what we need here. My one concern is that it might not be efficient at computing group.ngroups for a very large number of groups, which we would ideally want. Is that something I should be worried about? -Mike On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote: Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress but still not committed to the trunk). I would be easier from the Solr side if you could do something at index time, like indicating which document is the current one and which one is an old one (you would need to update the old document whenever a new version is indexed). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Re: ideas for versioning query?
Hi Mike, how many docs and groups do you have in your index? I think the group.sort option fits your requirements. If I remember correctly group.ngroup=true adds something like 30% extra time on top of the search request with grouping, but that was on my local test dataset (~30M docs, ~8000 groups) and my machine. You might encounter different search times when setting group.ngroup=true. Martijn 2011/8/1 Mike Sokolov soko...@ifactory.com Thanks, Tomas. Yes we are planning to keep a current flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsingand it seems as if it will do what we need here. My one concern is that it might not be efficient at computing group.ngroups for a very large number of groups, which we would ideally want. Is that something I should be worried about? -Mike On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote: Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress but still not committed to the trunk). I would be easier from the Solr side if you could do something at index time, like indicating which document is the current one and which one is an old one (you would need to update the old document whenever a new version is indexed). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com -- Met vriendelijke groet, Martijn van Groningen
Re: ideas for versioning query?
I think a 30% increase is acceptable. Yes, I think we'll try it. Although our case is more like # groups ~ # documents / N, where N is a smallish number (~1-5?). We are planning for a variety of different index sizes, but aiming for a sweet spot around a few M docs. -Mike On 08/01/2011 11:00 AM, Martijn v Groningen wrote: Hi Mike, how many docs and groups do you have in your index? I think the group.sort option fits your requirements. If I remember correctly group.ngroup=true adds something like 30% extra time on top of the search request with grouping, but that was on my local test dataset (~30M docs, ~8000 groups) and my machine. You might encounter different search times when setting group.ngroup=true. Martijn 2011/8/1 Mike Sokolovsoko...@ifactory.com Thanks, Tomas. Yes we are planning to keep a current flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsingand it seems as if it will do what we need here. My one concern is that it might not be efficient at computing group.ngroups for a very large number of groups, which we would ideally want. Is that something I should be worried about? -Mike On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote: Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress but still not committed to the trunk). I would be easier from the Solr side if you could do something at index time, like indicating which document is the current one and which one is an old one (you would need to update the old document whenever a new version is indexed). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Solr just 'hangs' under load test - ideas?
Hi, all. I'm hoping someone has some thoughts here. We're running Solr 3.1 (with the patch for SolrQueryParser.java to not do the getLuceneVersion() calls, but use luceneMatchVersion directly). We're running in a Tomcat instance, 64 bit Java. CATALINA_OPTS are: -Xmx7168m -Xms7168m -XX:MaxPermSize=256M We're running 2 Solr cores, with the same schema. We use SolrJ to run our searches from a Java app running in JBoss. JBoss, Tomcat, and the Solr Index folders are all on the same server. In case it's relevant, we're using JMeter as a load test harness. We're running on Solaris, a 16 processor box with 48GB physical memory. I've run a successful load test at a 100 user load (at that rate there are about 5-10 solr searches / second), and solr search responses were coming in under 100ms. When I tried to ramp up, as far as I can tell, Solr is just hanging. (We have some logging statements around the SolrJ calls - just before, we log how long our query construction takes, then we run the SolrJ query and log the search times. We're getting a number of the query construction logs, but no corresponding search time logs). Symptoms: The Tomcat and JBoss processes show as well under 1% CPU, and they are still the top processes. CPU states show around 99% idle. RES usage for the two Java processes around 3GB each. LWP under 120 for each. STATE just shows as sleep. JBoss is still 'alive', as I can get into a piece of software that talks to our JBoss app to get data. We set things up to use log4j logging for Solr - the log isn't showing any errors or exceptions. We're not indexing - just searching. Back in January, we did load testing on a prototype, and had no problems (though that was Solr 1.4 at the time). It ramped up beautifully - bottle necks were our apps, not Solr. What I'm benchmarking now is a descendent of that prototyping - a bit more complex on searches and more fields in the schema, but same basic search logic as far as SolrJ usage. Any ideas? What else to look at? Ringing any bells? I can send more details if anyone wants specifics... Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.comhttp://www.sirsidynix.com/
Re: Solr just 'hangs' under load test - ideas?
Can you get a thread dump to see what is hanging? -Yonik http://www.lucidimagination.com On Wed, Jun 29, 2011 at 11:45 AM, Bob Sandiford bob.sandif...@sirsidynix.com wrote: Hi, all. I'm hoping someone has some thoughts here. We're running Solr 3.1 (with the patch for SolrQueryParser.java to not do the getLuceneVersion() calls, but use luceneMatchVersion directly). We're running in a Tomcat instance, 64 bit Java. CATALINA_OPTS are: -Xmx7168m -Xms7168m -XX:MaxPermSize=256M We're running 2 Solr cores, with the same schema. We use SolrJ to run our searches from a Java app running in JBoss. JBoss, Tomcat, and the Solr Index folders are all on the same server. In case it's relevant, we're using JMeter as a load test harness. We're running on Solaris, a 16 processor box with 48GB physical memory. I've run a successful load test at a 100 user load (at that rate there are about 5-10 solr searches / second), and solr search responses were coming in under 100ms. When I tried to ramp up, as far as I can tell, Solr is just hanging. (We have some logging statements around the SolrJ calls - just before, we log how long our query construction takes, then we run the SolrJ query and log the search times. We're getting a number of the query construction logs, but no corresponding search time logs). Symptoms: The Tomcat and JBoss processes show as well under 1% CPU, and they are still the top processes. CPU states show around 99% idle. RES usage for the two Java processes around 3GB each. LWP under 120 for each. STATE just shows as sleep. JBoss is still 'alive', as I can get into a piece of software that talks to our JBoss app to get data. We set things up to use log4j logging for Solr - the log isn't showing any errors or exceptions. We're not indexing - just searching. Back in January, we did load testing on a prototype, and had no problems (though that was Solr 1.4 at the time). It ramped up beautifully - bottle necks were our apps, not Solr. What I'm benchmarking now is a descendent of that prototyping - a bit more complex on searches and more fields in the schema, but same basic search logic as far as SolrJ usage. Any ideas? What else to look at? Ringing any bells? I can send more details if anyone wants specifics... Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.comhttp://www.sirsidynix.com/
RE: Solr just 'hangs' under load test - ideas?
OK - I figured it out. It's not solr at all (and I'm not really surprised). In the prototype benchmarks, we used a different instance of tomcat than we're using for production load tests. Our prototype tomcat instance had no maxThreads value set, so was using the default value of 200. The production tomcat environment has a maxThreads value of 15 - we were just running out of threads and getting connection refused exceptions thrown when we ramped up the Solr hits past a certain level. Thanks for considering, Yonik (and any others waiting to see any reply I made)... (As others have said - this listserv is great!) Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Wednesday, June 29, 2011 12:18 PM To: solr-user@lucene.apache.org Subject: Re: Solr just 'hangs' under load test - ideas? Can you get a thread dump to see what is hanging? -Yonik http://www.lucidimagination.com On Wed, Jun 29, 2011 at 11:45 AM, Bob Sandiford bob.sandif...@sirsidynix.com wrote: Hi, all. I'm hoping someone has some thoughts here. We're running Solr 3.1 (with the patch for SolrQueryParser.java to not do the getLuceneVersion() calls, but use luceneMatchVersion directly). We're running in a Tomcat instance, 64 bit Java. CATALINA_OPTS are: -Xmx7168m -Xms7168m -XX:MaxPermSize=256M We're running 2 Solr cores, with the same schema. We use SolrJ to run our searches from a Java app running in JBoss. JBoss, Tomcat, and the Solr Index folders are all on the same server. In case it's relevant, we're using JMeter as a load test harness. We're running on Solaris, a 16 processor box with 48GB physical memory. I've run a successful load test at a 100 user load (at that rate there are about 5-10 solr searches / second), and solr search responses were coming in under 100ms. When I tried to ramp up, as far as I can tell, Solr is just hanging. (We have some logging statements around the SolrJ calls - just before, we log how long our query construction takes, then we run the SolrJ query and log the search times. We're getting a number of the query construction logs, but no corresponding search time logs). Symptoms: The Tomcat and JBoss processes show as well under 1% CPU, and they are still the top processes. CPU states show around 99% idle. RES usage for the two Java processes around 3GB each. LWP under 120 for each. STATE just shows as sleep. JBoss is still 'alive', as I can get into a piece of software that talks to our JBoss app to get data. We set things up to use log4j logging for Solr - the log isn't showing any errors or exceptions. We're not indexing - just searching. Back in January, we did load testing on a prototype, and had no problems (though that was Solr 1.4 at the time). It ramped up beautifully - bottle necks were our apps, not Solr. What I'm benchmarking now is a descendent of that prototyping - a bit more complex on searches and more fields in the schema, but same basic search logic as far as SolrJ usage. Any ideas? What else to look at? Ringing any bells? I can send more details if anyone wants specifics... Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.comhttp://www.sirsidynix.com/
Re: Ideas on how to implement sponsored results
Cuong, I think you will need some manipulation beyond solr queries. You should separate the results by your site criteria after retrieving them. After that, you could cache the results on your application and randomize the lists every time you render the a page. I don't know if solr has collapsing capabilities but it has any beyond faceting, it would be a great boost to your work. 2008/6/3 climbingrose [EMAIL PROTECTED]: Hi Alexander, Thanks for your suggestion. I think my problem is a bit different from yours. We don't have any sponsored words but we have to retrieve sponsored results directly from the index. This is because a site can have 60,000 products which is hard to insert/update keywords. I can live with that by issuing a separate query to fetch sponsored results. My problem is to equally distribute sponsored results between sites so that each site will have an opportunity to show their sponsored results no matter how many products they have. For example, if site A has 6 products, site B has only 2000 then sponsored products from site B will have a very small chance to be displayed. On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim [EMAIL PROTECTED] wrote: Cuong, I have implemented sponsored words for a client. I don't know if my working can help you but I will expose it and let you decide. I have an index containing products entries that I created a field called sponsored words. What I do is to boost this field , so when these words are matched in the query that products appear first on my result. 2008/6/3 climbingrose [EMAIL PROTECTED]: Hi all, I'm trying to implement sponsored results in Solr search results similar to that of Google. We index products from various sites and would like to allow certain sites to promote their products. My approach is to query a slave instance to get sponsored results for user queries in addition to the normal search results. This part is easy. However, since the number of products indexed for each sites can be very different (100, 1000, 1 or 6 products), we need a way to fairly distribute the sponsored results among sites. My initial thought is utilising field collapsing patch to collapse the search results on siteId field. You can imagine that this will create a series of buckets of results, each bucket representing results from a site. After that, 2 or 3 buckets will randomly be selected from which I will randomly select one or two results from. However, since I want these sponsored results to be relevant to user queries, I'd like only want to have the first 30 results in each buckets. Obviously, it's desirable that if the user refreshes the page, new sponsored results will be displayed. On the other hand, I also want to have the advantages of Solr cache. What would be the best way to implement this functionality? Thanks. Cheers, Cuong -- Alexander Ramos Jardim -- Regards, Cuong Hoang -- Alexander Ramos Jardim
Ideas on how to implement sponsored results
Hi all, I'm trying to implement sponsored results in Solr search results similar to that of Google. We index products from various sites and would like to allow certain sites to promote their products. My approach is to query a slave instance to get sponsored results for user queries in addition to the normal search results. This part is easy. However, since the number of products indexed for each sites can be very different (100, 1000, 1 or 6 products), we need a way to fairly distribute the sponsored results among sites. My initial thought is utilising field collapsing patch to collapse the search results on siteId field. You can imagine that this will create a series of buckets of results, each bucket representing results from a site. After that, 2 or 3 buckets will randomly be selected from which I will randomly select one or two results from. However, since I want these sponsored results to be relevant to user queries, I'd like only want to have the first 30 results in each buckets. Obviously, it's desirable that if the user refreshes the page, new sponsored results will be displayed. On the other hand, I also want to have the advantages of Solr cache. What would be the best way to implement this functionality? Thanks. Cheers, Cuong
Re: Ideas on how to implement sponsored results
Cuong, I have implemented sponsored words for a client. I don't know if my working can help you but I will expose it and let you decide. I have an index containing products entries that I created a field called sponsored words. What I do is to boost this field , so when these words are matched in the query that products appear first on my result. 2008/6/3 climbingrose [EMAIL PROTECTED]: Hi all, I'm trying to implement sponsored results in Solr search results similar to that of Google. We index products from various sites and would like to allow certain sites to promote their products. My approach is to query a slave instance to get sponsored results for user queries in addition to the normal search results. This part is easy. However, since the number of products indexed for each sites can be very different (100, 1000, 1 or 6 products), we need a way to fairly distribute the sponsored results among sites. My initial thought is utilising field collapsing patch to collapse the search results on siteId field. You can imagine that this will create a series of buckets of results, each bucket representing results from a site. After that, 2 or 3 buckets will randomly be selected from which I will randomly select one or two results from. However, since I want these sponsored results to be relevant to user queries, I'd like only want to have the first 30 results in each buckets. Obviously, it's desirable that if the user refreshes the page, new sponsored results will be displayed. On the other hand, I also want to have the advantages of Solr cache. What would be the best way to implement this functionality? Thanks. Cheers, Cuong -- Alexander Ramos Jardim
Re: Ideas on how to implement sponsored results
Hi Alexander, Thanks for your suggestion. I think my problem is a bit different from yours. We don't have any sponsored words but we have to retrieve sponsored results directly from the index. This is because a site can have 60,000 products which is hard to insert/update keywords. I can live with that by issuing a separate query to fetch sponsored results. My problem is to equally distribute sponsored results between sites so that each site will have an opportunity to show their sponsored results no matter how many products they have. For example, if site A has 6 products, site B has only 2000 then sponsored products from site B will have a very small chance to be displayed. On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim [EMAIL PROTECTED] wrote: Cuong, I have implemented sponsored words for a client. I don't know if my working can help you but I will expose it and let you decide. I have an index containing products entries that I created a field called sponsored words. What I do is to boost this field , so when these words are matched in the query that products appear first on my result. 2008/6/3 climbingrose [EMAIL PROTECTED]: Hi all, I'm trying to implement sponsored results in Solr search results similar to that of Google. We index products from various sites and would like to allow certain sites to promote their products. My approach is to query a slave instance to get sponsored results for user queries in addition to the normal search results. This part is easy. However, since the number of products indexed for each sites can be very different (100, 1000, 1 or 6 products), we need a way to fairly distribute the sponsored results among sites. My initial thought is utilising field collapsing patch to collapse the search results on siteId field. You can imagine that this will create a series of buckets of results, each bucket representing results from a site. After that, 2 or 3 buckets will randomly be selected from which I will randomly select one or two results from. However, since I want these sponsored results to be relevant to user queries, I'd like only want to have the first 30 results in each buckets. Obviously, it's desirable that if the user refreshes the page, new sponsored results will be displayed. On the other hand, I also want to have the advantages of Solr cache. What would be the best way to implement this functionality? Thanks. Cheers, Cuong -- Alexander Ramos Jardim -- Regards, Cuong Hoang
JSON tokenizer? tagging ideas
I've been struggling with how to get various bits of structured data into solr documents. In various projects I have tried various ideas, but none feel great. Take a simple example where I want a document field to be the list of linked data with name, ID, and path. I have tried things like: doc field name=idID/field field name=linkIDA nameA pathA/field field name=linkIDB nameB pathB/field field name=linkIDC nameC pathC/field /doc this is ok -- when spaces are a problem, i've tokenized on \n -- but this feels very brittle. I'm considering a general JSON tokenizer and want to know what you all think. Consider: doc field name=idID/field field name=link{ id:10 name:nameA path:/... }/field field name=link{ id:11 name:nameB path:/... }/field field name=link{ id:12 name:nameB path:/... }/field /doc The tokenizer can make a token for each key:value pair, that is: id:10, name:nameA,path:,id:11... Perhaps this could be part of the general 'tag' design: http://wiki.apache.org/solr/UserTagDesign rather then having fixed prefixes ~erik#lucene, we could use json syntax: {user:erik, text:lucene, date:20071112 } Using noggit (http://svn.apache.org/repos/asf/labs/noggit/) the JSON parsing is super fast. The prefix queries are probably slower with a longer string, but I guess you could just use: {u:erik, t:lucene, d:20071112 } Thoughts? ryan
Any clever ideas to inject into solr? Without http?
I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
RE: Any clever ideas to inject into solr? Without http?
What we're looking for is a way to inject *without* using curl, or wget, or any other http-based communication. We'd like for the HTTP daemon to only handle search requests, not indexing requests on top of them. Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl _ david whalen senior applications developer eNR Services, Inc. [EMAIL PROTECTED] 203-849-7240 -Original Message- From: Clay Webster [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 11:43 AM To: solr-user@lucene.apache.org Subject: Re: Any clever ideas to inject into solr? Without http? Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
(re)building the index separately (ie. on a different computer) and then replacing the active index may be an option. David Whalen wrote: What we're looking for is a way to inject *without* using curl, or wget, or any other http-based communication. We'd like for the HTTP daemon to only handle search requests, not indexing requests on top of them. Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl _ david whalen senior applications developer eNR Services, Inc. [EMAIL PROTECTED] 203-849-7240 -Original Message- From: Clay Webster [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 11:43 AM To: solr-user@lucene.apache.org Subject: Re: Any clever ideas to inject into solr? Without http? Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote: 2: Is there a way to inject into solr without using POST / curl / http? Check http://wiki.apache.org/solr/EmbeddedSolr There's examples in java and cocoa to use the DirectSolrConnection class, querying and updating solr w/o a web server. It uses JNI in the Cocoa case. -b
Re: Any clever ideas to inject into solr? Without http?
If it's a contention between search and indexing, separate them via a query-slave and an index-master. --cw On 8/9/07, David Whalen [EMAIL PROTECTED] wrote: What we're looking for is a way to inject *without* using curl, or wget, or any other http-based communication. We'd like for the HTTP daemon to only handle search requests, not indexing requests on top of them. Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl _ david whalen senior applications developer eNR Services, Inc. [EMAIL PROTECTED] 203-849-7240 -Original Message- From: Clay Webster [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 11:43 AM To: solr-user@lucene.apache.org Subject: Re: Any clever ideas to inject into solr? Without http? Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
On 8/9/07, David Whalen [EMAIL PROTECTED] wrote: Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl One issue with HTTP is latency. You can get around that by adding multiple documents per request, or by using multiple threads concurrently. You can also bypass HTTP by using something like the CVS loader (very light weight) and specifying a local file (via stream.file parameter). http://wiki.apache.org/solr/UpdateCSV I doubt you will see much of a difference between reading locally vs streaming over HTTP, but it might be interesting to see the exact overhead. -Yonik
Re: Any clever ideas to inject into solr? Without http?
On 8/9/07, Siegfried Goeschl [EMAIL PROTECTED] wrote: +) my colleague just finished a database import service running within the servlet container to avoid writing out the data to the file system and transmitting it over HTTP. Most people doing this read data out of the database and construct the XML in-memory for sending to Solr... one definitely doesn't want to write intermediate stuff to the filesystem (unless perhaps it's a CSV dump). +) I think there were some discussion regarding a generic database importer but nothing I'm aware of Absolutely a needed feature... it's in the queue: https://issues.apache.org/jira/browse/SOLR-103 But there will always be more complex cases, pulling from multiple data sources, doing some merging and munging, etc. The easiest way to handle many of those would probably be via a scripting language that does the app-specific merging+munging and then uses a Solr client (which constructs in-memory CSV or XML and sends to Solr). -Yonik
RE: Any clever ideas to inject into solr? Without http?
Is this a native feature, or do we need to get creative with scp from one server to the other? If it's a contention between search and indexing, separate them via a query-slave and an index-master. --cw
Re: Any clever ideas to inject into solr? Without http?
On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr For the most up-to-date solr client for python, check out https://issues.apache.org/jira/browse/SOLR-216 -Yonik
RE: Any clever ideas to inject into solr? Without http?
Jython is a Python interpreter implemented in Java. (I have a lot of Python code.) Total throughput in the servlet is very sensitive to the total number of servlet sockets available v.s. the number of CPUs. The different analysers have very different performance. You might leave some data in the DB, instead of storing it all in the index. Underlying this all, you have a sneaky network performance problem. Your successive posts do not reuse a TCP socket. Obvious: re-opening a new socket each post takes time. Not obvious: your server has sockets building up in TIME_WAIT state. (This means the sockets are shutting down. Having both ends agree to close the connection is metaphysically difficult. The TCP/IP spec even has a bug in this area.) Sockets building up can use TCP resources to run low or may run out. Your kernel configuration may be weak in this area. Lance -Original Message- From: Kevin Holmes [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 8:13 AM To: solr-user@lucene.apache.org Subject: Any clever ideas to inject into solr? Without http? I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
On Thu, 9 Aug 2007 15:23:03 -0700 Lance Norskog [EMAIL PROTECTED] wrote: Underlying this all, you have a sneaky network performance problem. Your successive posts do not reuse a TCP socket. Obvious: re-opening a new socket each post takes time. Not obvious: your server has sockets building up in TIME_WAIT state. (This means the sockets are shutting down. Having both ends agree to close the connection is metaphysically difficult. The TCP/IP spec even has a bug in this area.) Sockets building up can use TCP resources to run low or may run out. Your kernel configuration may be weak in this area. Good point. and putting my pedantic hat on here, it may not necessarily be 'kernel configuration', but network stack - not sure what OS the OP is using. B _ {Beto|Norberto|Numard} Meijome All parts should go together without forcing. You must remember that the parts you are reassembling were disassembled by you. Therefore, if you can't get them together again, there must be a reason. By all means, do not use hammer. IBM maintenance manual, 1975 I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Not really. The explain scores aren't normalized and I also couldn't : find a way to get the explain data as anything other than a whitespace : formatted text blob from Solr. Keep in mind that they need confidence the defualt way Solr dumps score explainations is just as plain text, but the Explanation objects are actually fairly well structured, and easy to walk in a custom request handler -- this would let you make direct comparisons of the various peices of the Explanations from doc 1 with doc 2 if you wanted. Does anyone have any experience with examining Explanation objects in a custom request handler? I started this project using Solr on top of Lucene because I wanted the flexibility it provided. The ability to have dynamic field names so the user could configure what fields they wanted to index and how they wanted them to be indexed (using field type configurations good for titles or for person names or for years, etc.). What I quickly found I could do without though was the HTTP overhead. I implemented the EmbeddedSolr class found on the Solr wiki that let me interact with the Solr engine directly. This is important since I'm doing thousands of queries in a batch. I need to find out about this custom request handler thing. If anyone has any example code, it would be greatly appreciated. Daniel
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
Yes, for good (hopefully) or bad. -Sean Shridhar Venkatraman wrote on 5/7/2007, 12:37 AM: Interesting.. Surrogates can also bring the searcher's subjectivity (opinion and context) into it by the learning process ? shridhar Sean Timm wrote: It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers. To do this, the master indexes would need to be able to communicate with each other. An other approach to merging across searchers is described here: Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir Frieder, "Surrogate Scoring for Improved Metasearch Precision" , Proceedings of the 2005 ACM Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, August 2005. -Sean [EMAIL PROTECTED] wrote: On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application. ie: run the large query once, get the results back, and for each result look at the explanation and pull out the individual pieces of hte explanation and compare them with those of hte other matches to create your own "normalization". Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 +(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
Interesting.. Surrogates can also bring the searcher's subjectivity (opinion and context) into it by the learning process ? shridhar Sean Timm wrote: It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers. To do this, the master indexes would need to be able to communicate with each other. An other approach to merging across searchers is described here: Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir Frieder, "Surrogate Scoring for Improved Metasearch Precision" , Proceedings of the 2005 ACM Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, August 2005. -Sean [EMAIL PROTECTED] wrote: On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application. ie: run the large query once, get the results back, and for each result look at the explanation and pull out the individual pieces of hte explanation and compare them with those of hte other matches to create your own "normalization". Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 +(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application. ie: run the large query once, get the results back, and for each result look at the explanation and pull out the individual pieces of hte explanation and compare them with those of hte other matches to create your own normalization. Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:${Year}^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:${Title}^10 title_title_mv:${Title}^2 +(title_title_mv:${Title}~^5 title_title_mv:${Title}~) director_name_mv:${Director}~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel
Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers. To do this, the master indexes would need to be able to communicate with each other. An other approach to merging across searchers is described here: Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir Frieder, "Surrogate Scoring for Improved Metasearch Precision" , Proceedings of the 2005 ACM Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, August 2005. -Sean [EMAIL PROTECTED] wrote: On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application. ie: run the large query once, get the results back, and for each result look at the explanation and pull out the individual pieces of hte explanation and compare them with those of hte other matches to create your own "normalization". Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 +(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel
Any Parm Substituion Ideas...
I really like the flexibility of naming request handlers to append general constraints / filters. Has anyone spun thoughts around something like a solr.ParmSubstHandler or any way to pass maybe a special ps=0:discussions; ps=1:images; ps=2:false requestHandler name=partitioned class=solr.ParmSubstHandler lst name=defaults ... . lst name=appends str name=fqcategory:[0]/str str name=fqcategory:[1]/str str name=fqisadmin:[2]/str /lst ... /requestHandler This may be inappropriate for building into SOLR; I'm not sure, but I'm looking at techniques to round out the appends to be even more flexible. If there is interest and it makes sense to a wider audience, maybe I should try my hand at it. Thanks...Jim Dow.
Re: Any Parm Substituion Ideas...
I'm not certain that i understand exactly what you are describing, but there was some discussion a while back that may be similar... http://issues.apache.org/jira/browse/SOLR-109 ...there's not a lot in the issue itself, but the linked discussion may be fruitful for you. if what you are describing is the same thing then i certianly think it would be a handy addition to SolrQueryParser and the core request handlers. : Has anyone spun thoughts around something like a solr.ParmSubstHandler or any way to pass maybe a special : ps=0:discussions; ps=1:images; ps=2:false -Hoss