IndexSchema is not mutable error Solr Cloud 7.7.1
Hi All, I made a change to schema to add new fields in a collection, this was uploaded to Zookeeper via the below command: For the Schema solr zk cp file:E:\SolrCloud\server\solr\configsets\COLLECTIO N\conf\schema.xml zk:/configs/COLLECTION/schema.xml -z SERVERNAME1.uleaf.site For the Solrconfig solr zk cp file:E:\SolrCloud\server\solr\configsets\COLLECTIO N\conf\solrconfig.xml zk:/configs/COLLECTION/solrconfig.xml -z SERVERNAME1.uleaf.site Note: the solrconfig has defined. When I then go to update a record with the new field in you get the following error: org.apache.solr.common.SolrException: This IndexSchema is not mutable. at org.apache.solr.update.processor.AddSchemaFieldsUp dateProcessorFactory$AddSchemaFieldsUpdateProcesso r.processAdd(AddSchemaFieldsUpdateProcessorFactory .java:376) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.FieldMutatingUpda teProcessor.processAdd(FieldMutatingUpdateProcesso r.java:118) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.FieldMutatingUpda teProcessor.processAdd(FieldMutatingUpdateProcesso r.java:118) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.FieldMutatingUpda teProcessor.processAdd(FieldMutatingUpdateProcesso r.java:118) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.FieldMutatingUpda teProcessor.processAdd(FieldMutatingUpdateProcesso r.java:118) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.FieldNameMutating UpdateProcessorFactory$1.processAdd(FieldNameMutat ingUpdateProcessorFactory.java:75) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.FieldMutatingUpda teProcessor.processAdd(FieldMutatingUpdateProcesso r.java:118) at org.apache.solr.update.processor.UpdateRequestProc essor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.AbstractDefaultVa lueUpdateProcessorFactory$DefaultValueUpdateProces sor.processAdd(AbstractDefaultValueUpdateProcessor Factory.java:92) at org.apache.solr.handler.loader.JavabinLoader$1.upd ate(JavabinLoader.java:110) at org.apache.solr.client.solrj.request.JavaBinUpdate RequestCodec$StreamingCodec.readOuterMostDocIterat or(JavaBinUpdateRequestCodec.java:327) at org.apache.solr.client.solrj.request.JavaBinUpdate RequestCodec$StreamingCodec.readIterator(JavaBinUp dateRequestCodec.java:280) at org.apache.solr.common.util.JavaBinCodec.readObjec t(JavaBinCodec.java:333) at org.apache.solr.common.util.JavaBinCodec.readVal(J avaBinCodec.java:278) at org.apache.solr.client.solrj.request.JavaBinUpdate RequestCodec$StreamingCodec.readNamedList(JavaBinU pdateRequestCodec.java:235) at org.apache.solr.common.util.JavaBinCodec.readObjec t(JavaBinCodec.java:298) at org.apache.solr.common.util.JavaBinCodec.readVal(J avaBinCodec.java:278) at org.apache.solr.common.util.JavaBinCodec.unmarshal (JavaBinCodec.java:191) at org.apache.solr.client.solrj.request.JavaBinUpdate RequestCodec.unmarshal(JavaBinUpdateRequestCodec.j ava:126) at org.apache.solr.handler.loader.JavabinLoader.parse AndLoadDocs(JavabinLoader.java:123) at org.apache.solr.handler.loader.JavabinLoader.load( JavabinLoader.java:70) at org.apache.solr.handler.UpdateRequestHandler$1.loa d(UpdateRequestHandler.java:97) at org.apache.solr.handler.ContentStreamHandlerBase.h andleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleR equest(RequestHandlerBase.java:199) at org.apache.solr.core.SolrCore.execute(SolrCore.jav a:2551) at org.apache.solr.servlet.HttpSolrCall.execute(HttpS olrCall.java:710) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolr Call.java:516) at org.apache.solr.servlet.SolrDispatchFilter.doFilte r(SolrDispatchFilter.java:395) at org.apache.solr.servlet.SolrDispatchFilter.doFilte r(SolrDispatchFilter.java:341) at org.eclipse.jetty.servlet.ServletHandler$CachedCha in.doFilter(ServletHandler.java:1602) at
RE: Query regarding Solr Cloud Setup
Hi Jörn/Erick/Shawn thanks for your responses. @Jörn - much apprecaited for the heads up on Kerberos authentication its something we havent really considered at the moment, more production this may well be the case. With regards to the Solr Nodes 3 is something we are looking as a minimum, when adding a new Solr Node to the cluster will settings/configuration be applied by Zookeeper on the new node or is there manual intervention? @Erick - With regards to the core.properties, on standard Solr the update.autoCreateFields=false is within the core.properites file however for Cloud I have it added within Solrconfig.xml which gets uploaded to Zookeeper, apprecaite standard and cloud may work entirely different just wanted to ensure it’s the correct way of doing it. @Shawn - Will try the creation of the lib directory in Solr Home to see if it gets picked up and having 5 Zookeepers would more than satisy high availability. Regards Ian -Original Message- From: Jörn Franke If you have a properly secured cluster eg with Kerberos then you should not update files in ZK directly. Use the corresponding Solr REST interfaces then you also less likely to mess something up. If you want to have HA you should have at least 3 Solr nodes and replicate the collection to all three of them (more is not needed from a HA point of view). This would also allow you upgrades to the cluster without downtime. -Original Message- From: erickerick...@gmail.com> Having custom core.properties files is “fraught”. First of all, that file can be re-written. Second, the collections ADDREPLICA command will create a new core.properties file. Third, any mistakes you make when hand-editing the file can have grave consequences. What change exactly do you want to make to core.properties and why? Trying to reproduce “what a colleague has done on standalone” is not something I’d recommend, SolrCloud is a different beast. Reproducing the _behavior_ is another thing, so what is the behavior you want in SolrCloud that causes you to want to customize core.properties? Best, Erick -Original Message- From: Shawn Heisey I cannot tell what you are asking here. The core.properties file lives on the disk, not in ZK. I was under the impression that .jar files could not be loaded into ZK and used in a core config. Documentation saying otherwise was recently pointed out to me on the list, but I remain skeptical that this actually works, and I have not tried to implement it myself. The best way to handle custom jar loading is to create a "lib" directory under the solr home, and place all jars there. Solr will automatically load them all before any cores are started, and no config commands of any kind will be needed to make it happen. > Also from a high availability aspect, if I effectivly lost 2 of the Solr > Servers due to an outage will the system still work as expected? Would I > expect any data loss? If all three Solr servers have a complete copy of all your indexes, then you should remain fully operational if two of those Solr servers go down. Note that if you have three ZK servers and you lose two, that means that you have lost zookeeper quorum, and in that situation, SolrCloud will transition to read only -- you will not be able to change any index in the cloud. This is how ZK is designed and it cannot be changed. If you want a ZK deployment to survive the loss of two servers, you must have at least five total ZK servers, so more than 50 percent of the total survives. Thanks, Shawn smime.p7s Description: S/MIME cryptographic signature
Query regarding Solr Cloud Setup
Hi, I am relatively new to Solr especially Solr Cloud and have been using it for a few days now. I think I have setup Solr Cloud correctly however would like some guidance to ensure I am doing it correctly. I ideally want to be able to process 40 million documents on production via Solr Cloud. The number of fields is undefined as the documents may differ but could be around 20+. The current setup I have at present is as follows: (note this is all on 1 machine for now). A 3 Zookeeper Ensemble (all running on different ports) and works as expected. 3 Solar Nodes started on separate ports (note: directory path à D:\solr-7.7.1\example\cloud\Node (1/2/3). Setup of Solr would be similar to the above except its on my local, the below is the Graph status in Solr Cloud. I have a few questions which I cannot seem to find the answer for on the web. We have a schema which I have managed to upload to Zookeeper along with the Solrconfig, how do I get the system to recognise both a lib/.jar extension and a custom core.properties file? I bypassed the issue of the core.properties by amending the update.autoCreateField in the Solrconfig.xml to false however would like to include as a colleague has done on Solr Standlone. Also from a high availability aspect, if I effectivly lost 2 of the Solr Servers due to an outage will the system still work as expected? Would I expect any data loss? smime.p7s Description: S/MIME cryptographic signature
Re: Solr POST Tool Hidden Files
Agreed, but yes it skips them even when explicitly referenced by name. The line I linked to (530) will skip any file whose name begins with a dot. If there's a better workaround than what I've proposed then I'm certainly open to it. Best, Ian On Fri, Jun 1, 2018 at 1:25 PM, Alexandre Rafalovitch wrote: > Does it still skip them if they are provided directly by name? It is rather > a narrow use case. > > Regards, > Alex > > On Fri, Jun 1, 2018, 1:01 PM Ian Goldsmith-Roooney, < > iangoldsmithroo...@gmail.com> wrote: > > > Hello, > > > > I was hoping to make a small change to allow the simple POST tool to > accept > > a command line arg (-Dhidden=yes) so that it will not ignore hidden > files. > > Currently there is no toggle; it always ignores hidden files > > < > > https://github.com/apache/lucene-solr/blob/master/solr/ > core/src/java/org/apache/solr/util/SimplePostTool.java#L530 > > >. > > > > > > Having never contribute to Solr before, can somebody point me to the best > > way of making this change if it is acceptable? The HowToContribute > > indicated that one can either go the route of GitHub fork -> PR or JIRA > bug > > -> patch. Additionally wasn't sure which git branch would be best in this > > case. Any guidance on best practices is much appreciated! > > > > Best, > > -- > > Ian Goldsmith-Rooney > > > -- Ian Goldsmith-Rooney
Solr POST Tool Hidden Files
Hello, I was hoping to make a small change to allow the simple POST tool to accept a command line arg (-Dhidden=yes) so that it will not ignore hidden files. Currently there is no toggle; it always ignores hidden files <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/util/SimplePostTool.java#L530>. Having never contribute to Solr before, can somebody point me to the best way of making this change if it is acceptable? The HowToContribute indicated that one can either go the route of GitHub fork -> PR or JIRA bug -> patch. Additionally wasn't sure which git branch would be best in this case. Any guidance on best practices is much appreciated! Best, -- Ian Goldsmith-Rooney
RE: Re:the number of docs in each group depends on rows
When I looked at this in solr 5.5.3 The second phase of the query was only sent to the shards that returned documents in the first phase, the problem is that one shard may contain matching documents in a group but ranked outside the top N results. Fatduo this solution won't help you unless you are looking at changing some solr code, but is to help with Diego point that maby this could be fixed(as a starting point to look at as the code may have changed in 7.0). We changed the grouping code to search all shards on the second phase. (I think that this was all that was needed but we changed grouping to be two level so lots of change is grouping code) In the 5.5.3 code base we changed the method construceRequest(ResponseBuilder rb) in TopGroupsShardRequestFactory to always call createRequestForAllShards(rb) Ian NLA -Original Message- From: Diego Ceccarelli (BLOOMBERG/ LONDON) <dceccarel...@bloomberg.net> Sent: Friday, 4 May 2018 9:37 PM To: solr-user@lucene.apache.org Subject: Re:the number of docs in each group depends on rows Hello, I'm not sure 100% but I think that if you have multiple shards the number of docs matched in each group is *not* guarantee to be exact. Increasing the rows will increase the amount of partial information that each shard sends to the federator and make the number more precise. For exact counts you might need one shard OR to make sure that all the documents in the same group are in the same shard by using document routing via composite keys [1]. Thinking about that, it should be possible to fix grouping to compute the exact numbers on request... cheers, Diego [1] https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#shards-and-indexing-data-in-solrcloud From: solr-user@lucene.apache.org At: 05/04/18 07:53:41To: solr-user@lucene.apache.org Subject: the number of docs in each group depends on rows Hi, We used Solr Cloud 7.1.0(3 nodes, 3 shards with 2 replicas). When we used group query, we found that the number of docs in each group depends on the rows number(group number). difference: <http://lucene.472066.n3.nabble.com/file/t494000/difference.jpeg> when the rows bigger then 5, the return docs are correct and stable, for the rest, the number of docs is smaller than the actual result. Could you please explain why and give me some suggestion about how to decide the rows number? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
possible to dump "routing table" from a single Solr node?
Hi all, I'm having a situation where our SolrCloud cluster often gets into a bad state where our solr nodes frequently respond with "no servers hosting shard" even though the node that hosts that shard is clearly up. We suspect that this is a state bug where some servers are somehow ending up with an incorrect view of the network (e.g. which nodes are up/down, which shards are hosted on which nodes). Is it possible to somehow get a "dump" of the current "routing table" (i.e. documents with prefixes in this range in this collection are stored in this shard on this node)? That would help immensely when debugging. Thanks! - Ian
Using facets and stats with solr v4
Hi guys So i have a question about using facet queries but getting stats for each facet item, it seems this is possible on solr v5+. Something like this: q=*:*=true={!stats=t1}servicename={!tag=t1}dur ation=0=true=json=true It also seems this isnt avilable for lower verison (v4), so is there anyway to acheive similar as we as stuck on v4? Any help / advice / pointers would be great! - thanks in advance Ian
Re: Async deleteshard commands?
Done! https://issues.apache.org/jira/browse/SOLR-7481 On Tue, Apr 28, 2015 at 11:09 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is a bug. Can you please open a Jira issue? On Tue, Apr 28, 2015 at 8:35 PM, Ian Rose ianr...@fullstory.com wrote: Is it possible to run DELETESHARD commands in async mode? Google searches seem to indicate yes, but not definitively. My local experience indicates otherwise. If I start with an async SPLITSHARD like so: http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1 Then I get back the expected response format, with str name=requestid 12-foo-1/str And I can later query for the result via REQUESTSTATUS. However if I try an async DELETESHARD like so: http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4 The response includes the command result, indicating that the command was not run async: lst name=success lst name=192.168.1.106:8983_solr lst name=responseHeader int name=status0/int int name=QTime16/int /lst /lst /lst And in addition REQUESTSTATUS calls for that requestId fail with Did not find taskid [12-foo-4] in any tasks queue. Synchronous deletes are causing problems for me in production as they are timing out in some cases. Thanks, Ian p.s. I'm on version 5.0.0 -- Regards, Shalin Shekhar Mangar.
Re: Async deleteshard commands?
Hi Anshum, FWIW I find that page is not entirely accurate with regard to async params. For example, my testing shows that DELETEREPLICA *does* support the async param, although that is not listed here: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9 Cheers, Ian On Tue, Apr 28, 2015 at 12:47 PM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Ian, DELETESHARD doesn't support ASYNC calls officially. We could certainly do with a better response but I believe with most of the Collections API calls at this time in Solr, you could send random params which would get ignored. Therefore, in this case, I believe that the async param gets ignored. The go-to reference point to check what's supported is the official reference guide: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7 This doesn't mentioned support for async DELETESHARD calls. On Tue, Apr 28, 2015 at 8:05 AM, Ian Rose ianr...@fullstory.com wrote: Is it possible to run DELETESHARD commands in async mode? Google searches seem to indicate yes, but not definitively. My local experience indicates otherwise. If I start with an async SPLITSHARD like so: http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1 Then I get back the expected response format, with str name=requestid 12-foo-1/str And I can later query for the result via REQUESTSTATUS. However if I try an async DELETESHARD like so: http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4 The response includes the command result, indicating that the command was not run async: lst name=success lst name=192.168.1.106:8983_solr lst name=responseHeader int name=status0/int int name=QTime16/int /lst /lst /lst And in addition REQUESTSTATUS calls for that requestId fail with Did not find taskid [12-foo-4] in any tasks queue. Synchronous deletes are causing problems for me in production as they are timing out in some cases. Thanks, Ian p.s. I'm on version 5.0.0 -- Anshum Gupta
Re: Async deleteshard commands?
Sure. Here is an example of ADDREPLICA in synchronous mode: http://localhost:8983/solr/admin/collections?action=addreplicacollection=293shard=shard1_1 response: response lst name=responseHeader int name=status0/int int name=QTime1168/int /lst lst name=success lst lst name=responseHeader int name=status0/int int name=QTime1158/int /lst str name=core293_shard1_1_replica2/str /lst /lst /response And here is the same in asynchronous mode: http://localhost:8983/solr/admin/collections?action=addreplicacollection=293shard=shard1_1async=foo99 response: response lst name=responseHeader int name=status0/int int name=QTime2/int /lst str name=requestidfoo99/str /response Note that the format of this response does NOT match the response format that I got from the attempt at an async DELETESHARD in my earlier email. Also note that I am now able to query for the status of this request: http://localhost:8983/solr/admin/collections?action=requeststatusrequestid=foo99 response: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=status str name=statecompleted/str str name=msgfound foo99 in completed tasks/str /lst /response On Tue, Apr 28, 2015 at 2:06 PM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Ian, What do you mean by *my testing shows* ? Can you elaborate on the steps and how did you confirm that the call was indeed *async* ? I may be wrong but I think what you're seeing is a normal DELETEREPLICA call succeeding behind the scenes. It is not treated or processed as an async call. Also, that page is the official reference guide and might need fixing if it's out of sync. On Tue, Apr 28, 2015 at 10:47 AM, Ian Rose ianr...@fullstory.com wrote: Hi Anshum, FWIW I find that page is not entirely accurate with regard to async params. For example, my testing shows that DELETEREPLICA *does* support the async param, although that is not listed here: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9 Cheers, Ian On Tue, Apr 28, 2015 at 12:47 PM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Ian, DELETESHARD doesn't support ASYNC calls officially. We could certainly do with a better response but I believe with most of the Collections API calls at this time in Solr, you could send random params which would get ignored. Therefore, in this case, I believe that the async param gets ignored. The go-to reference point to check what's supported is the official reference guide: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7 This doesn't mentioned support for async DELETESHARD calls. On Tue, Apr 28, 2015 at 8:05 AM, Ian Rose ianr...@fullstory.com wrote: Is it possible to run DELETESHARD commands in async mode? Google searches seem to indicate yes, but not definitively. My local experience indicates otherwise. If I start with an async SPLITSHARD like so: http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1 Then I get back the expected response format, with str name=requestid 12-foo-1/str And I can later query for the result via REQUESTSTATUS. However if I try an async DELETESHARD like so: http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4 The response includes the command result, indicating that the command was not run async: lst name=success lst name=192.168.1.106:8983_solr lst name=responseHeader int name=status0/int int name=QTime16/int /lst /lst /lst And in addition REQUESTSTATUS calls for that requestId fail with Did not find taskid [12-foo-4] in any tasks queue. Synchronous deletes are causing problems for me in production as they are timing out in some cases. Thanks, Ian p.s. I'm on version 5.0.0 -- Anshum Gupta -- Anshum Gupta
Async deleteshard commands?
Is it possible to run DELETESHARD commands in async mode? Google searches seem to indicate yes, but not definitively. My local experience indicates otherwise. If I start with an async SPLITSHARD like so: http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1 Then I get back the expected response format, with str name=requestid 12-foo-1/str And I can later query for the result via REQUESTSTATUS. However if I try an async DELETESHARD like so: http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4 The response includes the command result, indicating that the command was not run async: lst name=success lst name=192.168.1.106:8983_solr lst name=responseHeader int name=status0/int int name=QTime16/int /lst /lst /lst And in addition REQUESTSTATUS calls for that requestId fail with Did not find taskid [12-foo-4] in any tasks queue. Synchronous deletes are causing problems for me in production as they are timing out in some cases. Thanks, Ian p.s. I'm on version 5.0.0
proper routing (from non-Java client) in solr cloud 5.0.0
Hi all - I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0. Our client is written in Go, for which I am not aware of a client, so we wrote our own. One tricky bit for this was the routing logic; if a document has routing prefix X and belong to collection Y, we need to know which solr node to connect to. Previously we accomplished this by watching the clusterstate.json file in zookeeper - at startup and whenever it changes, the client parses the file contents to build a routing table. However in 5.0 newly create collections do not show up in clusterstate.json but instead have their own state.json document. Are there any recommendations for how to handle this from the client? The obvious answer is to watch every collection's state.json document, but we run a lot of collections (~1000 currently, and growing) so I'm concerned about keeping that many watches open at the same time (should I be?). How does the SolrJ client handle this? Thanks! - Ian
Re: proper routing (from non-Java client) in solr cloud 5.0.0
Hi Hrishikesh, Thanks for the pointers - I had not looked at SOLR-5474 https://issues.apache.org/jira/browse/SOLR-5474 previously. Interesting approach... I think we will stick with trying to keep zk watches open from all clients to all collections for now, but if that starts to be a bottleneck its good to know how the route that Solrj has chosen... cheers, Ian On Tue, Apr 14, 2015 at 3:56 PM, Hrishikesh Gadre gadre.s...@gmail.com wrote: Hi Ian, As per my understanding, Solrj does not use Zookeeper watches but instead caches the information (along with a TTL). You can find more information here, https://issues.apache.org/jira/browse/SOLR-5473 https://issues.apache.org/jira/browse/SOLR-5474 Regards Hrishikesh On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0. Our client is written in Go, for which I am not aware of a client, so we wrote our own. One tricky bit for this was the routing logic; if a document has routing prefix X and belong to collection Y, we need to know which solr node to connect to. Previously we accomplished this by watching the clusterstate.json file in zookeeper - at startup and whenever it changes, the client parses the file contents to build a routing table. However in 5.0 newly create collections do not show up in clusterstate.json but instead have their own state.json document. Are there any recommendations for how to handle this from the client? The obvious answer is to watch every collection's state.json document, but we run a lot of collections (~1000 currently, and growing) so I'm concerned about keeping that many watches open at the same time (should I be?). How does the SolrJ client handle this? Thanks! - Ian
Re: Help understanding addreplica error message re: maxShardsPerNode
Wups - sorry folks, I send this prematurely. After typing this out I think I have it figured out - although SPLITSHARD ignores maxShardsPerNode, ADDREPLICA does not. So ADDREPLICA fails because I already have too many shards on a single node. On Wed, Apr 8, 2015 at 11:18 PM, Ian Rose ianr...@fullstory.com wrote: On my local machine I have the following test setup: * 2 nodes (JVMs) * 1 collection named testdrive, that was originally created with numShards=1 and maxShardsPerNode=1. * After a series of SPLITSHARD commands, I now have 4 shards, as follows: testdrive_shard1_0_0_replica1 (L) Active 115 testdrive_shard1_0_1_replica1 (L) Active 0 testdrive_shard1_1_0_replica1 (L) Active 5 testdrive_shard1_1_1_replica1 (L) Active 88 The number in the last column is the number of documents. The 4 shards are all on the same node; the second node holds nothing for this collection. Already, this situation is a little strange because I have 4 shards on one node, despite the fact that maxShardsPerNode is 1. My guess is that SPLITSHARD ignores the maxShardsPerNode value - is that right? Now, if I issue an ADDREPLICA command with collection=testdriveshard=shard1_0_0, I get the following error: Cannot create shards testdrive. Value of maxShardsPerNode is 1, and the number of live nodes is 2. This allows a maximum of 2 to be created. Value of numShards is 4 and value of replicationFactor is 1. This requires 4 shards to be created (higher than the allowed number) I don't totally understand this.
Re: change maxShardsPerNode for existing collection?
Thanks, I figured that might be the case (hand-editting clusterstate.json). - Ian On Wed, Apr 8, 2015 at 11:46 PM, ralph tice ralph.t...@gmail.com wrote: It looks like there's a patch available: https://issues.apache.org/jira/browse/SOLR-5132 Currently the only way without that patch is to hand-edit clusterstate.json, which is very ill advised. If you absolutely must, it's best to stop all your Solr nodes, backup the current clusterstate in ZK, modify it, and then start your nodes. On Wed, Apr 8, 2015 at 10:21 PM, Ian Rose ianr...@fullstory.com wrote: I previously created several collections with maxShardsPerNode=1 but I would now like to change that (to unlimited if that is an option). Is changing this value possible? Cheers, - Ian
Help understanding addreplica error message re: maxShardsPerNode
On my local machine I have the following test setup: * 2 nodes (JVMs) * 1 collection named testdrive, that was originally created with numShards=1 and maxShardsPerNode=1. * After a series of SPLITSHARD commands, I now have 4 shards, as follows: testdrive_shard1_0_0_replica1 (L) Active 115 testdrive_shard1_0_1_replica1 (L) Active 0 testdrive_shard1_1_0_replica1 (L) Active 5 testdrive_shard1_1_1_replica1 (L) Active 88 The number in the last column is the number of documents. The 4 shards are all on the same node; the second node holds nothing for this collection. Already, this situation is a little strange because I have 4 shards on one node, despite the fact that maxShardsPerNode is 1. My guess is that SPLITSHARD ignores the maxShardsPerNode value - is that right? Now, if I issue an ADDREPLICA command with collection=testdriveshard=shard1_0_0, I get the following error: Cannot create shards testdrive. Value of maxShardsPerNode is 1, and the number of live nodes is 2. This allows a maximum of 2 to be created. Value of numShards is 4 and value of replicationFactor is 1. This requires 4 shards to be created (higher than the allowed number) I don't totally understand this.
change maxShardsPerNode for existing collection?
I previously created several collections with maxShardsPerNode=1 but I would now like to change that (to unlimited if that is an option). Is changing this value possible? Cheers, - Ian
Re: rough maximum cores (shards) per machine?
Per - Wow, 1 trillion documents stored is pretty impressive. One clarification: when you say that you have 2 replica per collection on each machine, what exactly does that mean? Do you mean that each collection is sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards per machine)? Or are some of these slave replicas (e.g. 25x sharding with 1 replica per shard)? Thanks! On Wed, Mar 25, 2015 at 5:13 AM, Per Steffensen st...@designware.dk wrote: In one of our production environments we use 32GB, 4-core, 3T RAID0 spinning disk Dell servers (do not remember the exact model). We have about 25 collections with 2 replica (shard-instances) per collection on each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25 machines = 1250 replica. Each replica contains about 800 million pretty small documents - thats about 1000 billion (do not know the english word for it) documents all in all. We index about 1.5 billion new documents every day (mainly into one of the collections = 50 replica across 25 machine) and keep a history of 2 years on the data. Shifting the index into collection every month. We can fairly easy keep up with the indexing load. We have almost non of the data on the heap, but of course a small fraction of the data in the files will at any time be in OS file-cache. Compared to our indexing frequency we do not do a lot of searches. We have about 10 users searching the system from time to time - anything from major extracts to small quick searches. Depending on the nature of the search we have response-times between 1 sec and 5 min. But of course that is very dependent on clever choice on each field wrt index, store, doc-value etc. BUT we are not using out-of-box Apache Solr. We have made quit a lot of performance tweaks ourselves. Please note that, even though you disable all Solr caches, each replica will use heap-memory linearly dependent on the number of documents (and their size) in that replica. But not much, so you can get pretty far with relatively little RAM. Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it did not get worse in newer releases. Just to give you some idea of what can at least be achieved - in the high-end of #replica and #docs, I guess Regards, Per Steffensen On 24/03/15 14:02, Ian Rose wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: TooManyBasicQueries?
Hi Erik - Sorry, I totally missed your reply. To the best of my knowledge, we are not using any surround queries (have to admit I had never heard of them until now). We use solr.SearchHandler for all of our queries. Does that answer the question? Cheers, Ian On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com wrote: It results from a surround query with too many terms. Says the javadoc: * Exception thrown when {@link BasicQueryFactory} would exceed the limit * of query clauses. I’m curious, are you issuing a large {!surround} query or is it expanding to hit that limit? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote: I sometimes see the following in my logs: ERROR org.apache.solr.core.SolrCore – org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded maximum of 1000 basic queries. What does this mean? Does this mean that we have issued a query with too many terms? Or that the number of concurrent queries running on the server is too high? Also, is this a builtin limit or something set in a config file? Thanks! - Ian
rough maximum cores (shards) per machine?
Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: TooManyBasicQueries?
Ah yes, right you are. I had thought that `surround` required a different endpoint, but I see now that someone is using a surround query. Many thanks! On Tue, Mar 24, 2015 at 10:02 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Somehow a surround query is being constructed along the way. Search your logs for “surround” and see if someone is maybe sneaking a q={!surround}… in there. If you’re passing input directly through from your application to Solr’s q parameter without any sanitizing or filtering, it’s possible a surround query parser could be asked for. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Mar 24, 2015, at 8:55 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erik - Sorry, I totally missed your reply. To the best of my knowledge, we are not using any surround queries (have to admit I had never heard of them until now). We use solr.SearchHandler for all of our queries. Does that answer the question? Cheers, Ian On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com wrote: It results from a surround query with too many terms. Says the javadoc: * Exception thrown when {@link BasicQueryFactory} would exceed the limit * of query clauses. I’m curious, are you issuing a large {!surround} query or is it expanding to hit that limit? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote: I sometimes see the following in my logs: ERROR org.apache.solr.core.SolrCore – org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded maximum of 1000 basic queries. What does this mean? Does this mean that we have issued a query with too many terms? Or that the number of concurrent queries running on the server is too high? Also, is this a builtin limit or something set in a config file? Thanks! - Ian
Re: rough maximum cores (shards) per machine?
First off thanks everyone for the very useful replies thus far. Shawn - thanks for the list of items to check. #1 and #2 should be fine for us and I'll check our ulimit for #3. To add a bit of clarification, we are indeed using SolrCloud. Our current setup is to create a new collection for each customer. For now we allow SolrCloud to decide for itself where to locate the initial shard(s) but in time we expect to refine this such that our system will automatically choose the least loaded nodes according to some metric(s). Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Jack, can you explain a bit what you mean here? It looks like Toke caught your meaning but I'm afraid it missed me. What do you mean by business entity? Is your concern that with automatic creation of collections they will be distributed willy-nilly across the cluster, leading to uneven load across nodes? If it is relevant, the schema and solrconfig are controlled entirely by me and is the same for all collections. Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer - less chance of accidentally leaking data across customers) to use separate collections. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Regarding this, if by tenant you mean customer, this is not viable for us from a cost perspective. As I mentioned initially, many of our customers are very small so dedicating an entire machine to each of them would not be economical (or efficient). Or perhaps I am not understanding what your definition of tenant is? Cheers, Ian On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. It was my understanding that Ian used them interchangeably, but of course Ian it the only one that knows. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Thank you. Now your post makes a lot more sense. I will not argue against that. - Toke Eskildsen
Re: rough maximum cores (shards) per machine?
Let me give a bit of background. Our Solr cluster is multi-tenant, where we use one collection for each of our customers. In many cases, these customers are very tiny, so their collection consists of just a single shard on a single Solr node. In fact, a non-trivial number of them are totally empty (e.g. trial customers that never did anything with their trial account). However there are also some customers that are larger, requiring their collection to be sharded. Our strategy is to try to keep the total documents in any one shard under 20 million (honestly not sure where my coworker got that number from - I am open to alternatives but I realize this is heavily app-specific). So my original question is not related to indexing or query traffic, but just the sheer number of cores. For example, if I have 10 active cores on a machine and everything is working fine, should I expect that everything will still work fine if I add 10 nearly-idle cores to that machine? What about 100? 1000? I figure the overhead of each core is probably fairly low but at some point starts to matter. Does that make sense? - Ian On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Shards per collection, or across all collections on the node? It will all depend on: 1. Your ingestion/indexing rate. High, medium or low? 2. Your query access pattern. Note that a typical query fans out to all shards, so having more shards than CPU cores means less parallelism. 3. How many collections you will have per node. In short, it depends on what you want to achieve, not some limit of Solr per se. Why are you even sharding the node anyway? Why not just run with a single shard per node, and do sharding by having separate nodes, to maximize parallel processing and availability? Also be careful to be clear about using the Solr term shard (a slice, across all replica nodes) as distinct from the Elasticsearch term shard (a single slice of an index for a single replica, analogous to a Solr core.) -- Jack Krupansky On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
TooManyBasicQueries?
I sometimes see the following in my logs: ERROR org.apache.solr.core.SolrCore – org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded maximum of 1000 basic queries. What does this mean? Does this mean that we have issued a query with too many terms? Or that the number of concurrent queries running on the server is too high? Also, is this a builtin limit or something set in a config file? Thanks! - Ian
Re: Using Zookeeper with REST URL
I don't think zookeeper has a REST api. You'll need to use a Zookeeper client library in your language (or roll one yourself). On Wed, Nov 19, 2014 at 9:48 AM, nabil Kouici koui...@yahoo.fr wrote: Hi All, I'm connecting to solr using REST API (No library like SolJ). As my solr configuration is in cloud using Zookeeper ensemble, I don't know how to get available Solr server from ZooKeeper to be used in my URL Call. With SolrJ I can do: String zkHostString = 10.0.1.8:2181; CloudSolrServersolr = newCloudSolrServer(zkHostString); solr.connect(); SolrQuerysolrQuery = newSolrQuery(*:*); solrQuery.setRows(10); QueryResponse resp = solr.query(solrQuery); Any help. Regards, Nabil
Re: Ideas for debugging poor SolrCloud scalability
Hi again, all - Since several people were kind enough to jump in to offer advice on this thread, I wanted to follow up in case anyone finds this useful in the future. *tl;dr: *Routing updates to a random Solr node (and then letting it forward the docs to where they need to go) is very expensive, more than I expected. Using a smart router that uses the cluster config to route documents directly to their shard results in (near) linear scaling for us. *Expository version:* We use Go on our client, for which (to my knowledge) there is no SolrCloud router implementation. So we started by just routing updates to a random Solr node and letting it forward the docs to where they need to go. My theory was that this would lead to a constant amount of additional work (and thus still linear scaling). This was based on the observation that if you send an update of K documents to a Solr node in a N node cluster, in the worst case scenario, all K documents will need to be forwarded on to other nodes. Since Solr nodes have perfect knowledge of where docs belong, each doc would only take 1 additional hop to get to its replica. So random routing (in the limit) imposes 1 additional network hop for each document. In practice, however, we find that (for small networks, at least) per-node performance falls as you add shards. In fact, the client performance (in writes/sec) was essentially constant no matter how many shards we added. I do have a working theory as to why this might be (i.e. where the flaw is in my logic above) but as this is merely an unverified theory I don't want to lead anyone astray by writing it up here. However, by writing a smart router that retrieves the clusterstate.json file from Zookeeper and uses that to perfectly route documents to their proper shard, we were able to achieve much better scalability. Using a synthetic workload, we were able to achieve 141.7 writes/sec to a cluster of size 1 and 2506 writes/sec to a cluster of size 20 (125 writes/sec/node). So a dropoff of ~12% which is not too bad. We are hoping to continue our tests with larger clusters to ensure that the per-node write performance levels off and does not continue to drop as the cluster scales. I will also note that we initially had several bugs in our smart router implementation so if you follow a similar path and see bad performance look to your router implementation as you might not be routing correctly. We ended up writing a simple proxy that we ran in front of Solr to observe all requests which helped immensely when verifying and debugging our router. Yes tcpdump does something similar but viewing HTTP-level traffic is way more convenient than TCP-level. Plus Go makes little proxies like this super easy to do. Hope all that is useful to someone. Thanks again to the posters above for providing suggestions! - Ian On Sat, Nov 1, 2014 at 7:13 PM, Erick Erickson erickerick...@gmail.com wrote: bq: but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Not really. You've stated that you're not driving Solr very hard in your tests. Therefore you're waiting on I/O. Therefore your tests just aren't going to scale linearly with the number of shards. This is a simplification, but Your network utilization is pretty much irrelevant. I send a packet somewhere. somewhere does some stuff and sends me back an acknowledgement. While I'm waiting, the network is getting no traffic, so. If the network traffic was in the 90% range that would be different, so it's a good thing to monitor. Really, use a leader aware client and rack enough clients together that you're driving Solr hard. Then double the number of shards. Then rack enough _more_ clients to drive Solr at the same level. In this case I'll go out on a limb and predict near 2x throughput increases. One additional note, though. When you add _replicas_ to shards expect to see a drop in throughput that may be quite significant, 20-40% anecdotally... Best, Erick On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/1/2014 9:52 AM, Ian Rose wrote: Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. The basic problem I see with your methodology is that you are sending an update request and waiting for it to complete before sending another
Migrating shards
Howdy - What is the current best practice for migrating shards to another machine? I have heard suggestions that it is add replica on new machine, wait for it to catch up, delete original replica on old machine. But I wanted to check to make sure... And if that is the best method, two follow-up questions: 1. Is there a best practice for knowing when the new replica has caught up or do you just do a *:* query on both, compare counts, and call it a day when they are the same (or nearly so, since the slave replica might lag a little bit)? 2. When deleting the original (old) replica, since that one could be the leader, is the replica deletion done in a safe manner such that no documents will be lost (e.g. ones that were recently received by the leader and not yet synced over to the slave replica before the leader is deleted)? Thanks as always, Ian
Re: Migrating shards
Sounds great - thanks all. On Fri, Nov 7, 2014 at 2:06 PM, Erick Erickson erickerick...@gmail.com wrote: bq: I think ADD/DELETE replica APIs are best for within a SolrCloud I second this, if for no other reason than I'd expect this to get more attention than the underlying core admin API. That said, I believe ADD/DELETE replica just makes use of the core admin API under the covers, in which case you'd get all the goodness baked into the core admin API plus whatever extra is written into the collections api processing. Best, Erick On Fri, Nov 7, 2014 at 8:28 AM, ralph tice ralph.t...@gmail.com wrote: I think ADD/DELETE replica APIs are best for within a SolrCloud, however if you need to move data across SolrClouds you will have to resort to older APIs, which I didn't find good documentation of but many references to. So I wrote up the instructions to do so here: https://gist.github.com/ralph-tice/887414a7f8082a0cb828 I haven't had much time to think about how to translate this to more generic documentation for inclusion in the community wiki but I would love to hear some feedback if anyone else has a similar use case for moving Solr indexes across SolrClouds. On Fri, Nov 7, 2014 at 10:18 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: 1. The new replica will not begin serving data until it's all there and caught up. You can watch the replica status on the Cloud screen to see it catch up; when it's green, you're done. If you're trying to automate this, you're going to look for the replica that says recovering in clusterstate.json and wait until it's active. 2. I believe this to be the case, but I'll wait for someone else to chime in who knows better. Also, I wonder if there's a difference between DELETEREPLICA and unloading the core directly. Michael On 11/7/14 10:24, Ian Rose wrote: Howdy - What is the current best practice for migrating shards to another machine? I have heard suggestions that it is add replica on new machine, wait for it to catch up, delete original replica on old machine. But I wanted to check to make sure... And if that is the best method, two follow-up questions: 1. Is there a best practice for knowing when the new replica has caught up or do you just do a *:* query on both, compare counts, and call it a day when they are the same (or nearly so, since the slave replica might lag a little bit)? 2. When deleting the original (old) replica, since that one could be the leader, is the replica deletion done in a safe manner such that no documents will be lost (e.g. ones that were recently received by the leader and not yet synced over to the slave replica before the leader is deleted)? Thanks as always, Ian
Re: any difference between using collection vs. shard in URL?
Awesome, thanks. That's what I was hoping. Cheers, Ian On Wed, Nov 5, 2014 at 10:33 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There's no difference between the two. Even if you send updates to a shard url, it will still be forwarded to the right shard leader according to the hash of the id (assuming you're using the default compositeId router). Of course, if you happen to hit the right shard leader then it is just an internal forward and not an extra network hop. The advantage with using the collection name is that you can hit any SolrCloud node (even the ones not hosting this collection) and it will still work. So for a non Java client, a load balancer can be setup in front of the entire cluster and things will just work. On Wed, Nov 5, 2014 at 8:50 PM, Ian Rose ianr...@fullstory.com wrote: If I add some documents to a SolrCloud shard in a collection alpha, I can post them to /solr/alpha/update. However I notice that you can also post them using the shard name, e.g. /solr/alpha_shard4_replica1/update - in fact this is what Solr seems to do internally (like if you send documents to the wrong node so Solr needs to forward them over to the leader of the correct shard). Assuming you *do* always post your documents to the correct shard, is there any difference between these two, performance or otherwise? Thanks! - Ian -- Regards, Shalin Shekhar Mangar.
Re: Ideas for debugging poor SolrCloud scalability
Erick, Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. Cheers, Ian On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan peterlkee...@gmail.com wrote: Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com wrote: Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which
Re: Ideas for debugging poor SolrCloud scalability
Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client. This set of operations will eventually degrade the scaling. bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. If we're talking search requests, the answer is no. Sharding is what you do when your collection no longer fits on a single node. If it _does_ fit on a single node, then you'll usually get better query performance by adding a bunch of replicas to a single shard. When the number of docs on each shard grows large enough that you no longer get good query performance, _then_ you shard. And take the query hit. If we're talking about inserts, then see above. I suspect your problem is that you're _not_ saturating the SolrCloud cluster, you're sending docs to Solr very inefficiently and waiting on I/O. Batching docs and sending them to the right leader should scale pretty linearly until you start saturating your network. Best, Erick On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote: Thanks for the suggestions so for, all. 1) We are not using SolrJ on the client (not using Java at all) but I am working on writing a smart router so that we can always send to the correct node. I am certainly curious to see how that changes things. Nonetheless even with the overhead of extra routing hops, the observed behavior (no increase in performance with more nodes) doesn't make any sense to me. 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6) and autoSoftCommit (maxTime=15000). 3) Suggestions to batch documents certainly make sense for production code
Ideas for debugging poor SolrCloud scalability
Howdy all - The short version is: We are not seeing Solr Cloud performance scale (event close to) linearly as we add nodes. Can anyone suggest good diagnostics for finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud fail to scale? In detail: We have used Solr (in non-Cloud mode) for over a year and are now beginning a transition to SolrCloud. To this end I have been running some basic load tests to figure out what kind of capacity we should expect to provision. In short, I am seeing very poor scalability (increase in effective QPS) as I add Solr nodes. I'm hoping to get some ideas on where I should be looking to debug this. Apologies in advance for the length of this email; I'm trying to be comprehensive and provide all relevant information. Our setup: 1 load generating client - generates tiny, fake documents with unique IDs - performs only writes (no queries at all) - chooses a random solr server for each ADD request (with 1 doc per add request) N collections spread over K solr servers - every collection is sharded K times (so every solr instance has 1 shard from every collection) - no replicas - external zookeeper server (not using zkRun) - autoCommit maxTime=6 - autoSoftCommit maxTime =15000 Everything is running within a single zone on Google Compute Engine, so high quality gigabit network links between all machines (ping times 1ms). My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. In brief (more detailed results at the bottom of email), I find that for any number of nodes between 2 and 5, the QPS always caps out at ~3000. Obviously something must be wrong here, as there should be a trend of the QPS scaling (roughly) linearly with the number of nodes. Or at the very least going up at all! So my question is what else should I be looking at here? * CPU on the loadtest client is well under 100% * No other obvious bottlenecks on loadtest client (running 2 clients leads to ~1/2 qps on each) * In many cases, CPU on the solr servers is quite low as well (e.g. with 100 users hitting 5 solr nodes, all nodes are 50% idle) * Network bandwidth is a few MB/s, well under the gigabit capacity of our network * Disk bandwidth ( 2 MB/s) and iops ( 20/s) are low. Any ideas? Thanks very much! - Ian p.s. Here is my raw data broken out by number of nodes and number of simulated users: Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210 229021528502202900240321026032002803210210031803138535158031020903152560320 27603252890380305041375451560410220041525004202700425280043028505152450520 2640525279053028405100290052002810
Re: Ideas for debugging poor SolrCloud scalability
If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/30/2014 2:23 PM, Ian Rose wrote: My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. If you want to increase QPS, you should not be increasing numShards. You need to increase replicationFactor. When your numShards matches the number of servers, every single server will be doing part of the work for every query. If you increase replicationFactor instead, then each server can be doing a different query in parallel. Sharding the index is what you need to do when you need to scale the size of the index, so each server does not get overwhelmed by dealing with every document for every query. Getting a high QPS with a big index requires increasing both numShards *AND* replicationFactor. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
Thanks for the suggestions so for, all. 1) We are not using SolrJ on the client (not using Java at all) but I am working on writing a smart router so that we can always send to the correct node. I am certainly curious to see how that changes things. Nonetheless even with the overhead of extra routing hops, the observed behavior (no increase in performance with more nodes) doesn't make any sense to me. 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6) and autoSoftCommit (maxTime=15000). 3) Suggestions to batch documents certainly make sense for production code but in this case I am not real concerned with absolute performance; I just want to see the *relative* performance as we use more Solr nodes. So I don't think batching or not really matters. 4) No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Can you elaborate on this some? I'm worried I might be misunderstanding something fundamental. A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. Regarding your comment of increase the number of processes or threads, note that for each value of K (number of Solr nodes) I measured with several different numbers of simulated users so that I could find a saturation point. For example, take a look at my data for K=2: Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102 1003180 It's clear that once the load test client has ~40 simulated users, the Solr cluster is saturated. Creating more users just increases the average request latency, such that the total QPS remained (nearly) constant. So I feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps. The problem is that I am finding roughly this same max point, no matter how many simulated users the load test client created, for any value of K ( 1). Cheers, - Ian On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson erickerick...@gmail.com wrote: Your indexing client, if written in SolrJ, should use CloudSolrServer which is, in Matt's terms leader aware. It divides up the documents to be indexed into packets that where each doc in the packet belongs on the same shard, and then sends the packet to the shard leader. This avoids a lot of re-routing and should scale essentially linearly. You may have to add more clients though, depending upon who hard the document-generator is working. Also, make sure that you send batches of documents as Shawn suggests, I use 1,000 as a starting point. Best, Erick On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/30/2014 2:56 PM, Ian Rose wrote: I think this is true only for actual queries, right? I am not issuing any queries, only writes (document inserts). In the case of writes, increasing the number of shards should increase my throughput (in ops/sec) more or less linearly, right? No, that won't affect indexing speed all that much. The way to increase indexing speed is to increase the number of processes or threads that are indexing at the same time. Instead of having one client sending update requests, try five of them. Also, index many documents with each update request. Sending one document at a time is very inefficient. You didn't say how you're doing commits, but those need to be as infrequent as you can manage. Ideally, you would use autoCommit with openSearcher=false on an interval of about five minutes, and send an explicit commit (with the default openSearcher=true) after all the indexing is done. You may have requirements regarding document visibility that this won't satisfy, but try to avoid doing commits with openSearcher=true (soft commits qualify for this) extremely frequently, like once a second. Once a minute is much more realistic. Opening a new searcher is an expensive operation, especially if you have cache warming configured. Thanks, Shawn
Re: Slow inserts when using Solr Cloud
Very interested in what you find out with your benchmarking, and whether it bears out what I've experienced. Does anyone know when 4.10 is likely to be released? I'm benchmarking this right now so I'll share some numbers soon. -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4150963.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slow inserts when using Solr Cloud
I've built and installed the latest snapshot of Solr 4.10 using the same SolrCloud configuration and that gave me a tenfold increase in throughput, so it certainly looks like SOLR-6136 was the issue that was causing my slow insert rate/high latency with shard routing and replicas. Thanks for your help. Timothy Potter wrote Hi Ian, What's the CPU doing on the leader? Have you tried attaching a profiler to the leader while running and then seeing if there are any hotspots showing. Not sure if this is related but we recently fixed an issue in the area of leader forwarding to replica that used too many CPU cycles inefficiently - see SOLR-6136. Tim -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4149219.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slow inserts when using Solr Cloud
Hi Tim Thanks for the info about the bug. I've just looked at the CPU usage for the leader using JConsole, while my bulk load process was running, inserting documents into my Solr cloud. Is that what you meant by profiling and looking for hotspots? I find the CPU usage goes up quite a lot when the replica is enabled, compared to when it is disabled: http://lucene.472066.n3.nabble.com/file/n4147645/solr-cpu-usage.jpg In the above chart, the dip in CPU usage in the middle was while the replica (which lives on a different VM) was disabled. Thanks Ian Timothy Potter wrote Hi Ian, What's the CPU doing on the leader? Have you tried attaching a profiler to the leader while running and then seeing if there are any hotspots showing. Not sure if this is related but we recently fixed an issue in the area of leader forwarding to replica that used too many CPU cycles inefficiently - see SOLR-6136. Tim -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slow inserts when using Solr Cloud
That's useful to know, thanks very much. I'll look into using CloudSolrServer, although I'm using solrnet at present. That would reduce some of the overhead - but not the extra 200ms I'm getting for forwarding to the replica when the replica is switched on. It does seem a very high overhead. When I consider that it takes 20ms to insert a new document to Solr with replicas disabled (if I route to the correct shard), you might expect it to take two to three times longer if it has to forward to one replica and then wait for a response, but an increase of 200ms seems really high doesn't it? Is there a forum where I should raise that? Thanks again for your help Ian Shalin Shekhar Mangar wrote You can use CloudSolrServer (if you're using Java) which will route documents correctly to the leader of the appropriate shard. On Tue, Jul 15, 2014 at 3:04 PM, ian lt; Ian.Williams@.nhs gt; wrote: Hi Mark Thanks for replying to my post. Would you know whether my findings are consistent with what other people see when using SolrCloud? One thing I want to investigate is whether I can route my updates to the correct shard in the first place, by having my client using the same hashing logic as Solr, and working out in advance which shard my inserts should be sent to. Do you know whether that's an approach that others have used? Thanks again Ian -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147183.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147481.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Slow inserts when using Solr Cloud
Hi Mark Thanks for replying to my post. Would you know whether my findings are consistent with what other people see when using SolrCloud? One thing I want to investigate is whether I can route my updates to the correct shard in the first place, by having my client using the same hashing logic as Solr, and working out in advance which shard my inserts should be sent to. Do you know whether that's an approach that others have used? Thanks again Ian -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147183.html Sent from the Solr - User mailing list archive at Nabble.com.
Slow inserts when using Solr Cloud
Hi I'm encountering a surprisingly high increase in response times when I insert new documents into a SolrCloud, compared with a standalone Solr instance. I have a SolrCloud set up for test and evaluation purposes. I have four shards, each with a leader and a replica, distributed over four Windows virtual servers. I have zookeeper running on three of the four servers. There are not many documents in my SolrCloud (just a few hundred). I am using composite id routing, specifying a prefix to my document ids which is then used by Solr to determine which shard the document should be stored on. I determine in advance which shard a document with a given id prefix will end up in, by trying it out in advance. I then try the following scenarios, using inserts without commits. E.g. I use: curl http://servername:port/solr/update -H Content-Type: text/xml --data-binary @test.txt 1. Insert a document, sending it to the server hosting the correct shard, with replicas turned off (response time 20ms) I find that if I 'switch off' the replicas for my shard (by shutting down Solr for the replicas), and then I send the new document to the server hosting the leader for the correct shard, then I get a very fast response, i.e. under 10ms, which is similar to the performance I get when not using SolrCloud. This is expected, as I've removed any overhead to do with replicas or routing to the correct shard. 2. Insert a document, sending it to the server hosting the correct shard, but with replicas turned on (response time approx 250ms) If I switch on the replica for that shard, then my average response time for an insert increases from 10ms to around 250ms. Now I expect an overhead, because the leader has to find out where the replica is (from Zookeeper?) and then forward the request to that replica, then wait for a reply - but an increase from 20ms to 250ms seems very high? 3. Insert a document, sending it to a server hosting the incorrect shard, with replicas turned on (response time approx 500ms) If I do the same thing again but this time send to the server hosting a different shard to the shard my document will end up in, the average response times increase again to around 500ms. Again, I'd expect an increase because of the extra step of needing to forward to the correct shard, but the increase seems very high? Should I expect this much of an overhead for shard routing and replicas, or might this indicate a problem in my configuration? Many thanks Ian --- Maer wybodaeth a gynhwysir yn y neges e-bost hon ac yn unrhyw atodiadaun gyfrinachol. Os ydych yn ei derbyn ar gam, rhowch wybod ir anfonwr ai dileun ddi-oed. Ni fwriedir i ddatgelu i unrhyw un heblaw am y derbynnydd, boed yn anfwriadol neu fel arall, hepgor cyfrinachedd. Efallai bydd Gwasanaeth Gwybodeg GIG Cymru (NWIS) yn monitro ac yn cofnodi pob neges e-bost rhag firysau a defnydd amhriodol. Maen bosibl y bydd y neges e-bost hon ac unrhyw atebion neu atodiadau dilynol yn ddarostyngedig ir Ddeddf Rhyddid Gwybodaeth. Maer farn a fynegir yn y neges e-bost hon yn perthyn ir anfonwr ac nid ydynt o reidrwydd yn perthyn i NWIS. The information included in this email and any attachments is confidential. If received in error, please notify the sender and delete it immediately. Disclosure to any party other than the addressee, whether unintentional or otherwise, is not intended to waive confidentiality. The NHS Wales Informatics Service (NWIS) may monitor and record all emails for viruses and inappropriate use. This e-mail and any subsequent replies or attachments may be subject to the Freedom of Information Act. The views expressed in this email are those of the sender and not necessarily of NWIS. ---
Re: Faceting on a date field multiple times
Thanks Marc. On May 4, 2012, at 8:52 PM, Marc Sturlese wrote: http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html -- View this message in context: http://lucene.472066.n3.nabble.com/Faceting-on-a-date-field-multiple-times-tp3961282p3961865.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceting on a date field multiple times
Hi. I would like to be able to do a facet on a date field, but with different ranges (in a single query). for example. I would like to show #documents by day for the last week - #documents by week for the last couple of months #documents by year for the last several years. is there a way to do this without hitting solr 3 times? thanks Ian
Intermittent connection timeouts to Solr server using SolrNet
Hi - I have also posted this question on SO: http://stackoverflow.com/questions/8741080/intermittent-connection-timeouts-to-solr-server-using-solrnet I have a production webserver hosting a search webpage, which uses SolrNet to connect to another machine which hosts the Solr search server (on a subnet which is in the same room, so no network problems). All is fine 90% of the time, but I consistently get a small number of The operation has timed out errors. I've increased the timeout in the SolrNet init to *30* seconds (!) SolrNet.Startup.InitSolrDataObject( new SolrNet.Impl.SolrConnection( System.Configuration.ConfigurationManager.AppSettings[URL] ) {Timeout = 3} ); ...but all that happened is I started getting this message instead of Unable to connect to the remote server which I was seeing before. It seems to have made no difference to the amount of timeout errors. I can see *nothing* in *any* log (believe me I've looked!) and clearly my configuration is correct because it works most of the time. Anyone any ideas how I can find more information on this problem? Thanks! -- Ian i...@isfluent.com a...@endissolutions.com +44 (0)1223 257903
Re: Unable to index documents using DataImportHandler with MSSQL
Right. This is REALLY weird - I've now started from scratch on another machine (this time Windows 7), and got _exactly_ the same problem !? On Mon, Nov 28, 2011 at 7:37 AM, Husain, Yavar yhus...@firstam.com wrote: Hi Ian I am having exactly the same problem what you are having on Win 7 and 2008 Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html I still have not received any replies which could solve my problem till now. Please do let me know if you have arrived at some solution for your problem. Thanks. Regards, Yavar -Original Message- From: Ian Grainger [mailto:i...@isfluent.com] Sent: Friday, November 25, 2011 10:59 PM To: solr-user@lucene.apache.org Subject: Re: Unable to index documents using DataImportHandler with MSSQL Update on this: I've established: * It's not a problem in the DB (I can index from this DB into a Solr instance on another server) * It's not Tomcat (I get the same problem in Jetty) * It's not the schema (I have simplified it to one field) That leaves SolrConfig.xml and data-config. Only thing changed in SolrConfig.xml is adding: lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-cell-\d.*\.jar / lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-clustering-\d.*\.jar / lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-dataimporthandler-\d.*\.jar / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configD:/Software/Solr/example/solr/conf/data-config.xml/str /lst /requestHandler And data-config.xml is pretty much as attached - except simpler. Any help or any advice on how to diagnose would be appreciated! On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger i...@isfluent.com wrote: Hi I have copied my Solr config from a working Windows server to a new one, and it can't seem to run an import. They're both using win server 2008 and SQL 2008R2. This is the data importer config dataConfig dataSource type=JdbcDataSource name=ds1 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost;databaseName=DB user=Solr password=pwd/ document name=datas entity name=data dataSource=ds1 pk=key query=EXEC SOLR_COMPANY_SEARCH_DATA deltaImportQuery=SELECT * FROM Company_Search_Data WHERE [key]='${dataimporter.delta.key}' deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt '${dataimporter.last_index_time}' field column=WorkDesc_Comments name=WorkDesc_Comments_Split / field column=WorkDesc_Comments name=WorkDesc_Comments_Edge / /entity /document /dataConfig I can use MS SQL Profiler to watch the Solr user log in successfully, but then nothing. It doesn't seem to even try and execute the stored procedure. Any ideas why this would be working one server and not on another? FTR the only thing in the tomcat catalina log is: org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity data with URL: jdbc:sqlserver://localhost;databaseName=CATLive -- Ian i...@isfluent.com +44 (0)1223 257903 -- Ian i...@isfluent.com +44 (0)1223 257903 ** This message may contain confidential or proprietary information intended only for the use of the addressee(s) named above or may contain information that is legally privileged. If you are not the intended addressee, or the person responsible for delivering it to the intended addressee, you are hereby notified that reading, disseminating, distributing or copying this message is strictly prohibited. If you have received this message by mistake, please immediately notify us by replying to the message and delete the original message and any copies immediately thereafter. Thank you.- ** FAFLD -- Ian i...@isfluent.com +44 (0)1223 257903
Re: Unable to index documents using DataImportHandler with MSSQL
Hah, I've just come on here to suggest you do the same thing! Thanks for getting back to me - and interesting we both came up with the same solution! Now I have the problem that running a delta update updates the 'dataimport.properties' file - but then just re-fetches all the data regardless! Weird! On Mon, Nov 28, 2011 at 11:59 AM, Husain, Yavar yhus...@firstam.com wrote: Hi Ian I downloaded and build latest Solr (3.4) from sources and finally hit following line of code in Solr (where I put my debug statement) : if(url != null){ LOG.info(Yavar: getting handle to driver manager:); c = DriverManager.getConnection(url, initProps); LOG.info(Yavar: got handle to driver manager:); } The call to Driver Manager was not returning. Here was the error!! The Driver we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded another driver called jTDS jDBC driver and installed that. Problem got fixed!!! So please follow the following steps: 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the driver jar file into your Solr/lib directory where you had put Microsoft JDBC driver. 3. In the data-config.xml use this statement: driver=net.sourceforge.jtds.jdbc.Driver 4. Also in data-config.xml mention url like this: url=jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX 5. Now run your indexing. It should solve the problem. Regards, Yavar -Original Message- From: Ian Grainger [mailto:i...@isfluent.com] Sent: Monday, November 28, 2011 4:11 PM To: Husain, Yavar Cc: solr-user@lucene.apache.org Subject: Re: Unable to index documents using DataImportHandler with MSSQL Right. This is REALLY weird - I've now started from scratch on another machine (this time Windows 7), and got _exactly_ the same problem !? On Mon, Nov 28, 2011 at 7:37 AM, Husain, Yavar yhus...@firstam.com wrote: Hi Ian I am having exactly the same problem what you are having on Win 7 and 2008 Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html I still have not received any replies which could solve my problem till now. Please do let me know if you have arrived at some solution for your problem. Thanks. Regards, Yavar -Original Message- From: Ian Grainger [mailto:i...@isfluent.com] Sent: Friday, November 25, 2011 10:59 PM To: solr-user@lucene.apache.org Subject: Re: Unable to index documents using DataImportHandler with MSSQL Update on this: I've established: * It's not a problem in the DB (I can index from this DB into a Solr instance on another server) * It's not Tomcat (I get the same problem in Jetty) * It's not the schema (I have simplified it to one field) That leaves SolrConfig.xml and data-config. Only thing changed in SolrConfig.xml is adding: lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-cell-\d.*\.jar / lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-clustering-\d.*\.jar / lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-dataimporthandler-\d.*\.jar / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configD:/Software/Solr/example/solr/conf/data-config.xml/str /lst /requestHandler And data-config.xml is pretty much as attached - except simpler. Any help or any advice on how to diagnose would be appreciated! On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger i...@isfluent.com wrote: Hi I have copied my Solr config from a working Windows server to a new one, and it can't seem to run an import. They're both using win server 2008 and SQL 2008R2. This is the data importer config dataConfig dataSource type=JdbcDataSource name=ds1 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost;databaseName=DB user=Solr password=pwd/ document name=datas entity name=data dataSource=ds1 pk=key query=EXEC SOLR_COMPANY_SEARCH_DATA deltaImportQuery=SELECT * FROM Company_Search_Data WHERE [key]='${dataimporter.delta.key}' deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt '${dataimporter.last_index_time}' field column=WorkDesc_Comments name=WorkDesc_Comments_Split / field column=WorkDesc_Comments name=WorkDesc_Comments_Edge / /entity /document /dataConfig I can use MS SQL Profiler to watch the Solr user log in successfully, but then nothing. It doesn't seem to even try and execute the stored procedure. Any ideas why this would be working one server and not on another? FTR the only thing in the tomcat catalina log is: org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity data with URL: jdbc:sqlserver://localhost;databaseName=CATLive -- Ian i...@isfluent.com +44 (0)1223 257903
Re: DIH Strange Problem
Aha! That sounds like it might be it! On Mon, Nov 28, 2011 at 4:16 PM, Husain, Yavar yhus...@firstam.com wrote: Thanks Kai for sharing this. Ian encountered the same problem so marking him in the mail too. From: Kai Gülzau [kguel...@novomind.com] Sent: Monday, November 28, 2011 6:55 PM To: solr-user@lucene.apache.org Subject: RE: DIH Strange Problem Do you use Java 6 update 29? There is a known issue with the latest mssql driver: http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx In addition, there are known connection failure issues with Java 6 update 29, and the developer preview (non production) versions of Java 6 update 30 and Java 6 update 30 build 12. We are in contact with Java on these issues and we will update this blog once we have more information. Should work with update 28. Kai -Original Message- From: Husain, Yavar [mailto:yhus...@firstam.com] Sent: Monday, November 28, 2011 1:02 PM To: solr-user@lucene.apache.org; Shawn Heisey Subject: RE: DIH Strange Problem I figured out the solution and Microsoft and not Solr is the problem here :): I downloaded and build latest Solr (3.4) from sources and finally hit following line of code in Solr (where I put my debug statement) : if(url != null){ LOG.info(Yavar: getting handle to driver manager:); c = DriverManager.getConnection(url, initProps); LOG.info(Yavar: got handle to driver manager:); } The call to Driver Manager was not returning. Here was the error!! The Driver we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded another driver called jTDS jDBC driver and installed that. Problem got fixed!!! So please follow the following steps: 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the driver jar file into your Solr/lib directory where you had put Microsoft JDBC driver. 3. In the data-config.xml use this statement: driver=net.sourceforge.jtds.jdbc.Driver 4. Also in data-config.xml mention url like this: url=jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX 5. Now run your indexing. It should solve the problem. -Original Message- From: Husain, Yavar Sent: Thursday, November 24, 2011 12:38 PM To: solr-user@lucene.apache.org; Shawn Heisey Subject: RE: DIH Strange Problem Hi Thanks for your replies. I carried out these 2 steps (it did not solve my problem): 1. I tried setting responseBuffering to adaptive. Did not work. 2. For checking Database connection I wrote a simple java program to connect to database and fetch some results with the same driver that I use for solr. It worked. So it does not seem to be a problem with the connection. Now I am stuck where Tomcat log says: Creating a connection for entity . and does nothing, I mean after this log we usually get the getConnection() took x millisecond however I dont get that ,I can just see the time moving with no records getting fetched. Original Problem listed again: I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing data. Indexing and all was working perfectly fine. However today when I started full indexing again, Solr halts/stucks at the line Creating a connection for entity. There are no further messages after that. I can see that DIH is busy and on the DIH console I can see A command is still running, I can also see total rows fetched = 0 and total request made to datasource = 1 and time is increasing however it is not doing anything. This is the exact configuration that worked for me. I am not really able to understand the problem here. Also in the index directory where I am storing the index there are just 3 files: 2 segment files + 1 lucene*-write.lock file. ... data-config.xml: dataSource type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders user=testUser password=password/ document . . Logs: INFO: Server startup in 2016 ms Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=11 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6] Nov 23, 2011 4:11:27 PM
Unable to index documents using DataImportHandler with MSSQL
Hi I have copied my Solr config from a working Windows server to a new one, and it can't seem to run an import. They're both using win server 2008 and SQL 2008R2. This is the data importer config dataConfig dataSource type=JdbcDataSource name=ds1 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost;databaseName=DB user=Solr password=pwd/ document name=datas entity name=data dataSource=ds1 pk=key query=EXEC SOLR_COMPANY_SEARCH_DATA deltaImportQuery=SELECT * FROM Company_Search_Data WHERE [key]='${dataimporter.delta.key}' deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt '${dataimporter.last_index_time}' field column=WorkDesc_Comments name=WorkDesc_Comments_Split / field column=WorkDesc_Comments name=WorkDesc_Comments_Edge / /entity /document /dataConfig I can use MS SQL Profiler to watch the Solr user log in successfully, but then nothing. It doesn't seem to even try and execute the stored procedure. Any ideas why this would be working one server and not on another? FTR the only thing in the tomcat catalina log is: org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity data with URL: jdbc:sqlserver://localhost;databaseName=CATLive -- Ian i...@isfluent.com +44 (0)1223 257903
Re: Unable to index documents using DataImportHandler with MSSQL
Update on this: I've established: * It's not a problem in the DB (I can index from this DB into a Solr instance on another server) * It's not Tomcat (I get the same problem in Jetty) * It's not the schema (I have simplified it to one field) That leaves SolrConfig.xml and data-config. Only thing changed in SolrConfig.xml is adding: lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-cell-\d.*\.jar / lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-clustering-\d.*\.jar / lib dir=D:/Software/Solr/example/solr/dist/ regex=apache-solr-dataimporthandler-\d.*\.jar / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configD:/Software/Solr/example/solr/conf/data-config.xml/str /lst /requestHandler And data-config.xml is pretty much as attached - except simpler. Any help or any advice on how to diagnose would be appreciated! On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger i...@isfluent.com wrote: Hi I have copied my Solr config from a working Windows server to a new one, and it can't seem to run an import. They're both using win server 2008 and SQL 2008R2. This is the data importer config dataConfig dataSource type=JdbcDataSource name=ds1 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost;databaseName=DB user=Solr password=pwd/ document name=datas entity name=data dataSource=ds1 pk=key query=EXEC SOLR_COMPANY_SEARCH_DATA deltaImportQuery=SELECT * FROM Company_Search_Data WHERE [key]='${dataimporter.delta.key}' deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt '${dataimporter.last_index_time}' field column=WorkDesc_Comments name=WorkDesc_Comments_Split / field column=WorkDesc_Comments name=WorkDesc_Comments_Edge / /entity /document /dataConfig I can use MS SQL Profiler to watch the Solr user log in successfully, but then nothing. It doesn't seem to even try and execute the stored procedure. Any ideas why this would be working one server and not on another? FTR the only thing in the tomcat catalina log is: org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity data with URL: jdbc:sqlserver://localhost;databaseName=CATLive -- Ian i...@isfluent.com +44 (0)1223 257903 -- Ian i...@isfluent.com +44 (0)1223 257903
Solr 3.4 group.truncate does not work with facet queries
Hi, I'm using Grouping with group.truncate=true, The following simple facet query: facet.query=Monitor_id:[38 TO 40] Doesn't give the same number as the nGroups result (with grouping.ngroups=true) for the equivalent filter query: fq=Monitor_id:[38 TO 40] I thought they should be the same - from the Wiki page: 'group.truncate: If true, facet counts are based on the most relevant document of each group matching the query.' What am I doing wrong? If I turn off group.truncate then the counts are the same, as I'd expect - but unfortunately I'm only interested in the grouped results. - I have also asked this question on StackOverflow, here: http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries Thanks! -- Ian i...@isfluent.com a...@endissolutions.com +44 (0)1223 257903
Re: Solr 3.4 group.truncate does not work with facet queries
Thanks, Marijn. I have logged the bug here: https://issues.apache.org/jira/browse/SOLR-2863 Is there any chance of a workaround for this issue before the bug is fixed? If you want to answer the question on StackOverflow: http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries I'll accept your answer. On Fri, Oct 28, 2011 at 12:14 PM, Martijn v Groningen martijn.v.gronin...@gmail.com wrote: Hi Ian, I think this is a bug. After looking into the code the facet.query feature doesn't take into account the group.truncate option. This needs to be fixed. You can open a new issue in Jira if you want to. Martijn On 28 October 2011 12:09, Ian Grainger i...@isfluent.com wrote: Hi, I'm using Grouping with group.truncate=true, The following simple facet query: facet.query=Monitor_id:[38 TO 40] Doesn't give the same number as the nGroups result (with grouping.ngroups=true) for the equivalent filter query: fq=Monitor_id:[38 TO 40] I thought they should be the same - from the Wiki page: 'group.truncate: If true, facet counts are based on the most relevant document of each group matching the query.' What am I doing wrong? If I turn off group.truncate then the counts are the same, as I'd expect - but unfortunately I'm only interested in the grouped results. - I have also asked this question on StackOverflow, here: http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries Thanks! -- Ian i...@isfluent.com a...@endissolutions.com +44 (0)1223 257903 -- Met vriendelijke groet, Martijn van Groningen -- Ian i...@isfluent.com a...@endissolutions.com +44 (0)1223 257903
Re: Index directories on slaves
This turned out to be a missing SolrDeletionPolicy in the configuration. Once the slaves had a SolrDeletionPolicy, they stopped growing out of control. Ian. On Wed, Aug 17, 2011 at 8:46 AM, Ian Connor ian.con...@gmail.com wrote: Hi, We have noticed that many index.* directories are appearing on slaves (some more than others). e.g. ls shows index/index.20110101021510/ index.20110105030400/ index.20110106040701/ index.20110130031416/ index.20101222081713/ index.20110101034500/ index.20110105075100/ index.20110107085605/ index.20110812153349/ index.20101231011754/ index.20110105022600/ index.20110106024902/ index.20110108014100/ index.20110814204200/ Are this harmful, should I clean them out. I see a command for backup cleanup but am not sure the best way to clean these up (apart from removing all index* and getting a fresh replica). We have also seen on the latest 3.4 build that replicas are getting 1000s of files even though the masters have less than a 100 each. It seems as though they are not deleting after some replications and not sure if this is also related. We are trying to monitor this to see if we can find out how to reproduce it or at least the conditions that tend to reproduce it. -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Index directories on slaves
Hi, We have noticed that many index.* directories are appearing on slaves (some more than others). e.g. ls shows index/index.20110101021510/ index.20110105030400/ index.20110106040701/ index.20110130031416/ index.20101222081713/ index.20110101034500/ index.20110105075100/ index.20110107085605/ index.20110812153349/ index.20101231011754/ index.20110105022600/ index.20110106024902/ index.20110108014100/ index.20110814204200/ Are this harmful, should I clean them out. I see a command for backup cleanup but am not sure the best way to clean these up (apart from removing all index* and getting a fresh replica). We have also seen on the latest 3.4 build that replicas are getting 1000s of files even though the masters have less than a 100 each. It seems as though they are not deleting after some replications and not sure if this is also related. We are trying to monitor this to see if we can find out how to reproduce it or at least the conditions that tend to reproduce it. -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Re: solr-ruby: Error undefined method `closed?' for nil:NilClass
That is a good suggestion. At the very least I can catch this error and create a new connection when I see this - thanks. On Sun, Aug 14, 2011 at 3:46 PM, Erik Hatcher erik.hatc...@gmail.comwrote: Does instantiating a Solr::Connection for each request make things better? Erik On Aug 14, 2011, at 11:34 , Ian Connor wrote: It is nothing special - just like this: conn = Solr::Connection.new(http://#{LOCAL_SHARD};, {:timeout = 1000, :autocommit = :on}) options[:shards] = HA_SHARDS response = conn.query(query, options) Where LOCAL_SHARD points to a haproxy of a single shard and HA_SHARDS is an array of 18 shards (via haproxy). Ian. On Mon, Aug 8, 2011 at 12:50 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Ian - What does your solr-ruby using code look like? Solr::Connection is light-weight, so you could just construct a new one of those for each request. Are you keeping an instance around? Erik On Aug 8, 2011, at 12:03 , Ian Connor wrote: Hi, I have seen some of these errors come through from time to time. It looks like: /usr/lib/ruby/1.8/net/http.rb:1060:in `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in `send' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in `create_and_send_query' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in `query' It is as if the http object has gone away. Would it be good to create a new one inside of the connection or is something more serious going on? ubuntu 10.04 passenger 3.0.8 rails 2.3.11 -- Regards, Ian Connor -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Re: solr-ruby: Error undefined method `closed?' for nil:NilClass
It is nothing special - just like this: conn = Solr::Connection.new(http://#{LOCAL_SHARD};, {:timeout = 1000, :autocommit = :on}) options[:shards] = HA_SHARDS response = conn.query(query, options) Where LOCAL_SHARD points to a haproxy of a single shard and HA_SHARDS is an array of 18 shards (via haproxy). Ian. On Mon, Aug 8, 2011 at 12:50 PM, Erik Hatcher erik.hatc...@gmail.comwrote: Ian - What does your solr-ruby using code look like? Solr::Connection is light-weight, so you could just construct a new one of those for each request. Are you keeping an instance around? Erik On Aug 8, 2011, at 12:03 , Ian Connor wrote: Hi, I have seen some of these errors come through from time to time. It looks like: /usr/lib/ruby/1.8/net/http.rb:1060:in `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in `send' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in `create_and_send_query' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in `query' It is as if the http object has gone away. Would it be good to create a new one inside of the connection or is something more serious going on? ubuntu 10.04 passenger 3.0.8 rails 2.3.11 -- Regards, Ian Connor -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
solr-ruby: Error undefined method `closed?' for nil:NilClass
Hi, I have seen some of these errors come through from time to time. It looks like: /usr/lib/ruby/1.8/net/http.rb:1060:in `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in `send' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in `create_and_send_query' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in `query' It is as if the http object has gone away. Would it be good to create a new one inside of the connection or is something more serious going on? ubuntu 10.04 passenger 3.0.8 rails 2.3.11 -- Regards, Ian Connor
how does Solr/Lucene index multi-value fields
Hi. I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying) In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field? If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields. Regards Ian
Re: how does Solr/Lucene index multi-value fields
On May 31, 2011, at 12:11 PM, Erick Erickson wrote: Can you explain the use-case a bit more here? Especially the post-query processing and how you expect the multiple documents to help here. we have a collection of related stories. when a user searches for something, we might not want to display the story that is most-relevant (according to SOLR), but according to other home-grown rules. by combing all the possibilities in one SolrDocument, we can avoid a DB-hit to get related stories. But TF/IDF is calculated over all the values in the field. There's really no difference between a multi-valued field and storing all the data in a single field as far as relevance calculations are concerned. so.. it will suck regardless.. I thought we had per-field relevance in the current trunk. :-( Best Erick On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote: Hi. I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying) In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field? If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields. Regards Ian
Re: how does Solr/Lucene index multi-value fields
Thanks Erick. sadly in my use-case I don't that wouldn't work. I'll go back to storing them at the story level, and hitting a DB to get related stories I think. --I On May 31, 2011, at 12:27 PM, Erick Erickson wrote: Hmmm, I may have mis-lead you. Re-reading my text it wasn't very well written TF/IDF calculations are, indeed, per-field. I was trying to say that there was no difference between storing all the data for an individual field as a single long string of text in a single-valued field or as several shorter strings in a multi-valued field. Best Erick On Tue, May 31, 2011 at 12:16 PM, Ian Holsman had...@holsman.net wrote: On May 31, 2011, at 12:11 PM, Erick Erickson wrote: Can you explain the use-case a bit more here? Especially the post-query processing and how you expect the multiple documents to help here. we have a collection of related stories. when a user searches for something, we might not want to display the story that is most-relevant (according to SOLR), but according to other home-grown rules. by combing all the possibilities in one SolrDocument, we can avoid a DB-hit to get related stories. But TF/IDF is calculated over all the values in the field. There's really no difference between a multi-valued field and storing all the data in a single field as far as relevance calculations are concerned. so.. it will suck regardless.. I thought we had per-field relevance in the current trunk. :-( Best Erick On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote: Hi. I want to store a list of documents (say each being 30-60k of text) into a single SolrDocument. (to speed up post-retrieval querying) In order to do this, I need to know if lucene calculates the TF/IDF score over the entire field or does it treat each value in the list as a unique field? If I can't store it as a multi-value, I could create a schema where I put each document into a unique field, but I'm not sure how to create the query to search all the fields. Regards Ian
Boosting score by distance
I have a bunch of documents representing points of interest indexed in Solr. I'm trying to boost the score of documents based on distance from an origin point, and having some difficulty. I'm currently using the standard query parser and sending in this query: (name:sushi OR tags:sushi OR classifiers:sushi) AND deleted:False AND owner:simplegeo I'm also using the spatial search to limit results to ones found within 25km of my origin point. The issue I'm having is that I need the score to be a blend of the FT match _and_ distance from the origin point; If i sort by distance, lots of low quality matches clog up the results for simple searches, but if I sort by score, more distant results overwhelm nearby, though less relevant (according to Solr) results. I think what I want to do is boost the score of documents based on the distance from the origin search point. Alternately, if there was some way to treat a match on any of the three fields as having equal weight, I believe that would get me much closer to what I want. The examples I've seen for doing this kind of thing use dismax and its boost function (`bf') parameter. I don't know if my queries are translatable to dismax syntax as they are now, and it looks like the boost functions don't work with the standard query parser — at least, I have been completely unable to change the score when using it. Is there some way to boost by the inverse of the distance using the standard query parser, or alternately, to filter my results by different fields with the dismax parser?
Re: resetting stats
Has there been any progress on this or tools people might use to capture the average or 90% time for the last hour? That would allow us to better match up slowness with other metrics like CPU/IO/Memory to find bottlenecks in the system. Thanks, Ian. On Wed, Mar 31, 2010 at 9:13 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Say I have 3 Cores names core0, core1, and core2, where only core1 and core2 : have documents and caches. If all my searches hit core0, and core0 shards : out to core1 and core2, then the stats from core0 would be accurate for : errors, timeouts, totalTime, avgTimePerRequest, avgRequestsPerSecond, etc. Ahhh yes. (i see what you mean by aggregating core now ... i thought you ment a core just for aggregatign stats) *If* you are using distributed search, then you can gather stats from the core you use for collating/aggregating from the other shards, and reloading that core should be cheap. but if you aren't already using distributed searching, it would be a bad idea from a performance standpoint to add it just to take advantage of being able to reload the coordinator core (the overhead of searching one distributed shard vs doing the same query directly is usually very measurable, even on if the shard is the same Solr instance as your coordinator) -Hoss -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Solr rate limiting / DoS attacks
Hi, I'm curious as to what approaches one would take to defend against users attacking a Solr service, especially if exposed to the internet as opposed to an intranet. I'm fairly new to Solr, is there anything built in? Is there anything in place to prevent the search engine from getting overwhelmed by a particular user or group of users, submitting loads of time-consuming queries as some form of a DoS attack? Additionally, is there a way of rate-limiting it so that only a certain number of queries per user/per hour can be submitted, etc? (for example, to prevent programmatic access to the search engine as opposed to a human user) Thanks, Ian
Re: getting a list of top page-ranked webpages
On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc ken.fos...@realestate.com wrote: A slightly different route to take, but one that should help test/refine a semantic parser is wikipedia. They make available their entire corpus, or any subset you define. The whole thing is like 14 terabytes, but you can get smaller sets. Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages from Alexa, and all of dmoz url's, in order to build the semantic engine in the first place. However, an outside corpus is required to test it's quality outside of this space. Cheers, Ian
Re: getting a list of top page-ranked webpages
On Thu, 16 Sep 2010 15:31:02 -0700, you wrote: The public terabyte dataset project would be a good match for what you need. http://bixolabs.com/datasets/public-terabyte-dataset-project/ Of course, that means we have to actually finish the crawl finalize the Avro format we use for the data :) There are other free collections of data around, though none that I know of which target top-ranked pages. -- Ken Hi Ken.. this looks exactly like what i need. There is the ClueWeb dataset, http://boston.lti.cs.cmu.edu/Data/clueweb09/ However, one must buy it from them, the crawl was done in 09, and it inclues a number of hard drives which are shipped to you. Any crawl that would be available as an Amazon Public Dataset would be totally perfect. Ian
getting a list of top page-ranked webpages
Hi, this question is a little off topic, but I thought since so many people on this are probably experts in this field, someone may know. I'm experimenting with my own semantic-based search engine, but I want to test it with a large corpus of web pages. Ideally I would like to have a list of the top 10M or top 100M page-ranked URL's in the world. Short of using Nutch to crawl the entire web and build this page-rank, is there any other ways? What other ways or resources might be available for me to get this (smaller) corpus of top webpages? Thanks, Ian
Re: How to find first document for the ALL search
Hi, The good news is that: /solr/select?q=*%3A*fq=start=1rows=1fl=id did work (kind of odd really) so I am reading all the documents from the bad one to a new solr using using the same configuration using ruby (complete rebuild). so far so good - it is gone through 500k out of 1.7M and seems to be the best I could think of. Running the luke tool and trying to check the index on a copy ended up destroying the index and leaving only about 5k documents left. Reading them out via ruby seemed better in this case (and less work than restoring from backup and re running a few days transactions to catch it up). Ian. On Wed, Jul 14, 2010 at 9:22 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I have found that this search crashes: : : /solr/select?q=*%3A*fq=start=0rows=1fl=id Ouch .. that exception is kind of hairy. it suggests that your index may have been corrupted in some way -- do you have nay idea what happened? have you tried using hte CheckIndex tool to see what it says? (I'd hate to help you workd arround this but get bit by a timebomb of some other bad docs later) : It looks like just that first document is bad. I am happy to delete it - but : not sure how to get to it. Does anyone know how to find it? CheckIndexes might help ... if it doesn't the next thing you might try is asking for a legitimate field name that you know no document has (ie: if you have a dynamicField with the pattern str_* because you have fields like str_foo and str_bar but you never have fields named strBOGUS then use fl=strBOGUS) and then add debugQuery=true to the URL -- the debug info should contain the id. I'll be honest thought: i'm guessing that if your example query doesn't work, by suggestion won't either -- because if you get that error just trying to access the id field, the same thing will probably happen when the debugComponent tries to look at up as well. -Hoss -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
How to find first document for the ALL search
I have found that this search crashes: /solr/select?q=*%3A*fq=start=0rows=1fl=id SEVERE: java.lang.IndexOutOfBoundsException: Index: 114, Size: 90 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:259) but this one works: /solr/select?q=*%3A*fq=start=1rows=1fl=id It looks like just that first document is bad. I am happy to delete it - but not sure how to get to it. Does anyone know how to find it? - Ian
Generating a sitemap
Been testing nutch to crawl for solr and I was wondering if anyone had already worked on a system for getting the urls out of solr and generating an XML sitemap for Google.
RE: Handling and sorting email addresses
Thanks Mitch, using the analysis page has been a real eye-opener and given me a better insight into how Solr was applying the filters (and more importantly in which order). I've ironically ended up with a charFilter mapping file as this seemed the only route to replacing characters before the tokenizer kicked in, unfortunately Solr just refused to allow sorting on anything tokenized with characters other than whitespace. Cheers, Ian. -Original Message- From: MitchK [mailto:mitc...@web.de] Sent: 07 March 2010 22:44 To: solr-user@lucene.apache.org Subject: Re: Handling and sorting email addresses Ian, did you have a look at Solr's admin analysis.jsp? When everything on the analysis's page is fine, you have missunderstood Solr's schema.xml-file. You've set two attributes in your schema.xml: stored = true indexed = true What you get as a response is the stored field value. The stored field value is the original field value, without any modifications. However, Solr is using the indexed field value to query your data. Kind regards - Mitch Ian Battersby wrote: Forgive what might seem like a newbie question but am struggling desperately with this. We have a dynamic field that holds email address and we'd like to be able to sort by it, obviously when trying to do this we get an error as it thinks the email address is a tokenized field. We've tried a custom field type using PatternReplaceFilterFactory to specify that @ and . should be replaced with AT and DOT but we just can't seem to get it to work, all the field still contain the unparsed email. We used an example found on the mailing-list for the field type: fieldType name=email class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=\. replacement= DOT replace=all / filter class=solr.PatternReplaceFilterFactory pattern=@ replacement= AT replace=all / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType .. our dynamic field looks like .. dynamicField name=dynamicemail_* type=email indexed=true stored=true multiValued=true / When writing a document to Solr it still seems to write the original email address (e.g. this.u...@somewhere.com) opposed to its parsed version (e.g. this DOT user AT somewhere DOT com). Can anyone help? We are running version 1.4 but have even tried the nightly build in an attempt to solve this problem. Thanks. -- View this message in context: http://old.nabble.com/Handling-and-sorting-email-addresses-tp27813111p278152 39.html Sent from the Solr - User mailing list archive at Nabble.com.
Handling and sorting email addresses
Forgive what might seem like a newbie question but am struggling desperately with this. We have a dynamic field that holds email address and we'd like to be able to sort by it, obviously when trying to do this we get an error as it thinks the email address is a tokenized field. We've tried a custom field type using PatternReplaceFilterFactory to specify that @ and . should be replaced with AT and DOT but we just can't seem to get it to work, all the field still contain the unparsed email. We used an example found on the mailing-list for the field type: fieldType name=email class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=\. replacement= DOT replace=all / filter class=solr.PatternReplaceFilterFactory pattern=@ replacement= AT replace=all / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType .. our dynamic field looks like .. dynamicField name=dynamicemail_* type=email indexed=true stored=true multiValued=true / When writing a document to Solr it still seems to write the original email address (e.g. this.u...@somewhere.com) opposed to its parsed version (e.g. this DOT user AT somewhere DOT com). Can anyone help? We are running version 1.4 but have even tried the nightly build in an attempt to solve this problem. Thanks.
[ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+
I just saw this on twitter, and thought you guys would be interested.. I haven't tried it, but it looks interesting. http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin Thanks for the RT Shalin!
Re: If you could have one feature in Solr...
On 2/24/10 8:42 AM, Grant Ingersoll wrote: What would it be? most of this will be coming in 1.5, but for me it's - sharding.. it still seems a bit clunky secondly.. this one isn't in 1.5. I'd like to be able to find interesting terms that appear in my result set that don't appear in the global corpus. it's kind of like doing a facet count on *:* and then on the search term and discount the terms that appear heavily on the global one. (sorry.. there is a textbook definition of this.. XX distance.. but I haven't got the books in front of me).
Re: HTTP ERROR: 404 missing core name in path after integrating nutch
Just wanted to give an update on my efforts. I installed the Feb. 26 update this morning. Was able to access /solr/admin. Copied over the nutch schema.xml. restarted solr and was able to access /solr/admin Edited solrconfig.xml to add the nutch requesthandler snippet from lucidimagination. Restarted solr and got the 404 missing core name in path error. What in the requesthandler snippet (see below) could be causing this error? from http://bit.ly/1mOb requestHandler name=/nutch class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf content^0.5 anchor^1.0 title^1.2 /str str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str str name=fl url /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int bool hl=true/ str name=q.alt*:*/str str name=hl.fltitle url content/str str name=f.title.hl.fragsize0/str str name=f.title.hl.alternateFieldtitle/str str name=f.url.hl.fragsize0/str str name=f.url.hl.alternateFieldurl/str str name=f.content.hl.fragmenterregex/str /lst /requestHandler Have a great weekend.
HTTP ERROR: 404 missing core name in path after integrating nutch
Hi everyone, Last night I was able to get solr up and running. Ran and was able to access: http://localhost:8983/solr/admin This morning, I started on the nutch crawling instructions over at: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ After adding the following to /solr/conf/solrconfig.xml: requestHandler name=/nutch class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf content^0.5 anchor^1.0 title^1.2 /str str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str str name=fl url /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int bool hl=true/ str name=q.alt*:*/str str name=hl.fltitle url content/str str name=f.title.hl.fragsize0/str str name=f.title.hl.alternateFieldtitle/str str name=f.url.hl.fragsize0/str str name=f.url.hl.alternateFieldurl/str str name=f.content.hl.fragmenterregex/str /lst /requestHandler going to http://localhost:8983/solr/admin suddenly throws a HTTP ERROR: 404 missing core name in path Why would adding the above snippet suddenly throw that error? Thanks.
Re: Distributed search and haproxy and connection build up
Not yet - but thanks for the link. I think that the OS also has a timeout that keeps it around even after this event and with heavy traffic I have seen this build up. Having said all this, the performance impact after testing was negligible for us but I thought I would post that haproxy can cause large numbers of connections on a busy site. Going directly to shards does cut the number of connections down a lot if someone else finds this to be a problem. I am looking forward to distribution under 1.5 where the | option allows redundancy in the request. This will solve the persistence problem while still allowing failover for the shard requests. Even after 1.5, I would then still advocate haproxy between ruby (or your http stack) and solr. It is just when Solr is sharding the request it can keep its connections open and save some resources here. Ian. On Thu, Feb 11, 2010 at 11:49 AM, Tim Underwood timunderw...@gmail.comwrote: Have you played around with the option httpclose or the option forceclose configuration options in HAProxy (both documented here: http://haproxy.1wt.eu/download/1.3/doc/configuration.txt)? -Tim On Wed, Feb 10, 2010 at 10:05 AM, Ian Connor ian.con...@gmail.com wrote: Thanks, I bypassed haproxy as a test and it did reduce the number of connections - but it did not seem as those these connections were hurting anything. Ian. On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog goks...@gmail.com wrote: This goes through the Apache Commons HTTP client library: http://hc.apache.org/httpclient-3.x/ We used 'balance' at another project and did not have any problems. On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor ian.con...@gmail.com wrote: I have been using distributed search with haproxy but noticed that I am suffering a little from tcp connections building up waiting for the OS level closing/time out: netstat -a ... tcp6 1 0 10.0.16.170%34654:53789 10.0.16.181%363574:8893 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43932 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43190 10.0.16.181%363574:8895 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:53770 TIME_WAIT tcp6 1 0 10.0.16.170%34654:41782 10.0.16.181%363574: CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:52169 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:55947 10.0.16.181%363574:8887 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:54040 TIME_WAIT tcp6 1 0 10.0.16.170%34654:40030 10.0.16.160%363574:8984 CLOSE_WAIT ... Digging a little into the haproxy documentation, it seems that they do not support persistent connections. Does solr normally persist the connections between shards (would this problem happen even without haproxy)? Ian. -- Lance Norskog goks...@gmail.com -- Regards, Ian Connor
Has anyone done request logging with Solr-Ruby for use in Rails?
The idea is that in the log is currently like: Completed in 1290ms (View: 152, DB: 75) | 200 OK [ http://localhost:3000/search?q=nik+gene+clusterview=2] I want to extend it to also track the Solr query times and time spent in solr-ruby like: Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [ http://localhost:3000/search?q=nik+gene+clusterview=2] Has anyone done such a plug-in or extension already? -- Regards, Ian Connor
Re: Has anyone done request logging with Solr-Ruby for use in Rails?
This seems to allow you to log each query - which is a good start. I was thinking of something that would add all the ms together and report it in the completed at line so you can get a higher level view of which requests take the time and where. Ian. On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown m...@patch.com wrote: On Thu, Feb 11, 2010 at 13:07, Ian Connor ian.con...@gmail.com wrote: The idea is that in the log is currently like: Completed in 1290ms (View: 152, DB: 75) | 200 OK [ http://localhost:3000/search?q=nik+gene+clusterview=2] I want to extend it to also track the Solr query times and time spent in solr-ruby like: Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [ http://localhost:3000/search?q=nik+gene+clusterview=2] Has anyone done such a plug-in or extension already? -- Regards, Ian Connor Here's a module in Sunspot::Rails that does that. It's written against RSolr, which is an alternative to solr-ruby, but the concept is the same: http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Re: Has anyone done request logging with Solr-Ruby for use in Rails?
...and probably break stuff - that might be why it hasn't been done. On Thu, Feb 11, 2010 at 1:28 PM, Mat Brown m...@patch.com wrote: Oh - indeed - sorry, didn't read your email closely enough : ) Yeah that would probably involve some pretty crufty monkey patching / use of globals... On Thu, Feb 11, 2010 at 13:22, Ian Connor ian.con...@gmail.com wrote: This seems to allow you to log each query - which is a good start. I was thinking of something that would add all the ms together and report it in the completed at line so you can get a higher level view of which requests take the time and where. Ian. On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown m...@patch.com wrote: On Thu, Feb 11, 2010 at 13:07, Ian Connor ian.con...@gmail.com wrote: The idea is that in the log is currently like: Completed in 1290ms (View: 152, DB: 75) | 200 OK [ http://localhost:3000/search?q=nik+gene+clusterview=2] I want to extend it to also track the Solr query times and time spent in solr-ruby like: Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [ http://localhost:3000/search?q=nik+gene+clusterview=2] Has anyone done such a plug-in or extension already? -- Regards, Ian Connor Here's a module in Sunspot::Rails that does that. It's written against RSolr, which is an alternative to solr-ruby, but the concept is the same: http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
Re: Distributed search and haproxy and connection build up
Thanks, I bypassed haproxy as a test and it did reduce the number of connections - but it did not seem as those these connections were hurting anything. Ian. On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog goks...@gmail.com wrote: This goes through the Apache Commons HTTP client library: http://hc.apache.org/httpclient-3.x/ We used 'balance' at another project and did not have any problems. On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor ian.con...@gmail.com wrote: I have been using distributed search with haproxy but noticed that I am suffering a little from tcp connections building up waiting for the OS level closing/time out: netstat -a ... tcp6 1 0 10.0.16.170%34654:53789 10.0.16.181%363574:8893 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43932 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43190 10.0.16.181%363574:8895 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:53770 TIME_WAIT tcp6 1 0 10.0.16.170%34654:41782 10.0.16.181%363574: CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:52169 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:55947 10.0.16.181%363574:8887 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:54040 TIME_WAIT tcp6 1 0 10.0.16.170%34654:40030 10.0.16.160%363574:8984 CLOSE_WAIT ... Digging a little into the haproxy documentation, it seems that they do not support persistent connections. Does solr normally persist the connections between shards (would this problem happen even without haproxy)? Ian. -- Lance Norskog goks...@gmail.com -- Regards, Ian Connor
Distributed search and haproxy and connection build up
I have been using distributed search with haproxy but noticed that I am suffering a little from tcp connections building up waiting for the OS level closing/time out: netstat -a ... tcp6 1 0 10.0.16.170%34654:53789 10.0.16.181%363574:8893 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43932 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43190 10.0.16.181%363574:8895 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:53770 TIME_WAIT tcp6 1 0 10.0.16.170%34654:41782 10.0.16.181%363574: CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:52169 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:55947 10.0.16.181%363574:8887 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:54040 TIME_WAIT tcp6 1 0 10.0.16.170%34654:40030 10.0.16.160%363574:8984 CLOSE_WAIT ... Digging a little into the haproxy documentation, it seems that they do not support persistent connections. Does solr normally persist the connections between shards (would this problem happen even without haproxy)? Ian.
Re: distributed search and failed core
My only suggestion is to put haproxy in front of two replicas and then have haproxy do the failover. If a shard fails, the whole search will fail unless you do something like this. On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.comwrote: hello *, in distributed search when a shard goes down, an error is returned and the search fails, is there a way to avoid the error and return the results from the shards that are still up? thx much --joe -- Regards, Ian Connor
Re: Lock problems: Lock obtain timed out
Can anyone think of a reason why these locks would hang around for more than 2 hours? I have been monitoring them and they look like they are very short lived. On Tue, Jan 26, 2010 at 10:15 AM, Ian Connor ian.con...@gmail.com wrote: We traced one of the lock files, and it had been around for 3 hours. A restart removed it - but is 3 hours normal for one of these locks? Ian. On Mon, Jan 25, 2010 at 4:14 PM, mike anderson saidthero...@gmail.comwrote: I am getting this exception as well, but disk space is not my problem. What else can I do to debug this? The solr log doesn't appear to lend any other clues.. Jan 25, 2010 4:02:22 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={} status=500 QTime=1990 Jan 25, 2010 4:02:22 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@ /solr8984/index/lucene-98c1cb272eb9e828b1357f68112231e0-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1402) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Should I consider changing the lock timeout settings (currently set to defaults)? If so, I'm not sure what to base these values on. Thanks in advance, mike On Wed, Nov 4, 2009 at 8:27 PM, Lance Norskog goks...@gmail.com wrote: This will not ever work reliably. You should have 2x total disk space for the index. Optimize, for one, requires this. On Wed, Nov 4, 2009 at 6:37 AM, Jérôme Etévé jerome.et...@gmail.com wrote: Hi, It seems this situation is caused by some No space left on device exeptions: SEVERE: java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.flushBuffer(SimpleFSDirectory.java:192) at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96) I'd better try to set my maxMergeDocs and mergeFactor to more adequates values for my app (I'm indexing ~15 Gb of data on 20Gb device, so I guess there's problem when solr tries to merge the index bits being build. At the moment, they are set to mergeFactor100/mergeFactor and maxMergeDocs2147483647/maxMergeDocs Jerome. -- Jerome Eteve. http://www.eteve.net jer...@eteve.net -- Lance Norskog goks...@gmail.com -- Regards, Ian Connor 1 Leighton St
Re: Lock problems: Lock obtain timed out
We traced one of the lock files, and it had been around for 3 hours. A restart removed it - but is 3 hours normal for one of these locks? Ian. On Mon, Jan 25, 2010 at 4:14 PM, mike anderson saidthero...@gmail.comwrote: I am getting this exception as well, but disk space is not my problem. What else can I do to debug this? The solr log doesn't appear to lend any other clues.. Jan 25, 2010 4:02:22 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={} status=500 QTime=1990 Jan 25, 2010 4:02:22 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@ /solr8984/index/lucene-98c1cb272eb9e828b1357f68112231e0-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1402) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Should I consider changing the lock timeout settings (currently set to defaults)? If so, I'm not sure what to base these values on. Thanks in advance, mike On Wed, Nov 4, 2009 at 8:27 PM, Lance Norskog goks...@gmail.com wrote: This will not ever work reliably. You should have 2x total disk space for the index. Optimize, for one, requires this. On Wed, Nov 4, 2009 at 6:37 AM, Jérôme Etévé jerome.et...@gmail.com wrote: Hi, It seems this situation is caused by some No space left on device exeptions: SEVERE: java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.flushBuffer(SimpleFSDirectory.java:192) at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96) I'd better try to set my maxMergeDocs and mergeFactor to more adequates values for my app (I'm indexing ~15 Gb of data on 20Gb device, so I guess there's problem when solr tries to merge the index bits being build. At the moment, they are set to mergeFactor100/mergeFactor and maxMergeDocs2147483647/maxMergeDocs Jerome. -- Jerome Eteve. http://www.eteve.net jer...@eteve.net -- Lance Norskog goks...@gmail.com
Re: checkindex
When I needed to use it, I couldn't find docs for it either but it's straight forward. Here's what I did: un-jar the solr war file to find the lucene jar that solr was using and run CheckIndex like this java -cp lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /path/to/solr/data/index/ to actually *fix* the index, add the -fix argument java -cp lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex -fix /path/to/solr/data/index/ hope that helps, -Ian On 1/8/10 2:09 PM, Giovanni Fernandez-Kincade wrote: I've seen many mentions of the Lucene CheckIndex tool, but where can I find it? Is there any documentation on how to use it? I noticed Luke has it built-in, but I can't get Luke to open my index with the Don't open IndexReader(when opening corrupted index) option check. Opening even an index I know is valid doesn't work using this option: -- Ian Kallen blog: http://www.arachna.com/roller/spidaman tweetz: http://twitter.com/spidaman vox: 925.385.8426
Re: Improvising solr queries
On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote: sitename:XYZ OR sitename:All Sites) AND (localeid:1237400589415) AND ((assettype:Gallery)) AND (rbcategory:ABC XYZ ) AND (startdate:[* TO 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO *])rows=9start=63sort=date descfacet=truefacet.field=assettypefacet.mincount=1 Similar to this query we have several much complex queries supporting all major landing pages of our application. Just want to confirm that whether anyone can identify any major flaws or issues in the sample query? I'm not the expert Shalin is, but I seem to remember sorting by date was pretty rough on CPU. (this could have been resolved since I last looked at it) the other thing I'd question is the facet. it looks like your only retrieving a single assetType (Gallery). so you will only get a single field back. if thats the case, wouldn't the rows returned (which is part of the response) give you the same answer ? Most of those AND conditions can be separate filter queries. Filter queries can be cached separately and can therefore be re-used. See http://wiki.apache.org/solr/FilterQueryGuidance
Re: Adaptive search?
On 12/18/09 2:46 AM, Siddhant Goel wrote: Let say we have a search engine (a simple front end - web app kind of a thing - responsible for querying Solr and then displaying the results in a human readable form) based on Solr. If a user searches for something, gets quite a few search results, and then clicks on one such result - is there any mechanism by which we can notify Solr to boost the score/relevance of that particular result in future searches? If not, then any pointers on how to go about doing that would be very helpful. Hi Siddhant. Solr can't do this out of the box. you would need to use a external field and a custom scoring function to do something like this. regards Ian Thanks, On Thu, Dec 17, 2009 at 7:50 PM, Paul Libbrechtp...@activemath.org wrote: What can it mean to adapt to user clicks ? Quite many things in my head. Do you have maybe a citation that inspires you here? paul Le 17-déc.-09 à 13:52, Siddhant Goel a écrit : Does Solr provide adaptive searching? Can it adapt to user clicks within the search results it provides? Or that has to be done externally?
RE: Selection of returned fields - dynamic fields?
OK thanks for the reply, fortunately we have now found an approach which avoids storing the field. It would be nice to be able to search for dynamic fields in a way which is consistent with their definition, although I suppose there probably isn't demand for this. Regards, Ian. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: 09 December 2009 19:36 To: solr-user@lucene.apache.org Cc: Gary Ratcliffe Subject: Re: Selection of returned fields - dynamic fields? : Unfortunately this does not seem to work for dynamic fields - you can definiltely ask for a field that exists because of a dynamicField by name, but you can't use wildcard style patterns in the fl param. : fl=PREFIX* does not return anything, and neither does fl=*POSTFIX. : What seems to be missing from Solr is a removeField(FIELDNAME) method in : SolrJ, or a fl=-FIELDNAME query parameter to remove the fixed field. : : Is such a feature planned, or is there a workaround that I have missed? There's been a lot of discussion about it over the years, the crux of the problem is that it's hard to come up with a good way of dealing with field names using meta characters that doesn't make it hard for people to actaully use those metacharacters in their field names... http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams -Hoss
Selection of returned fields - dynamic fields?
Hi Guys, We need to eliminate one of our stored fields from the Solr response to reduce traffic as it is very bulky and not used externally. I have been experimenting both with fl=FIELDNAME and addField(FIELDNAME) from SolrJ and have found it is possible to achieve this effect for fixed fields by starting with an empty list and adding the field names explicitly in the request. Unfortunately this does not seem to work for dynamic fields - fl=PREFIX* does not return anything, and neither does fl=*POSTFIX. What seems to be missing from Solr is a removeField(FIELDNAME) method in SolrJ, or a fl=-FIELDNAME query parameter to remove the fixed field. Is such a feature planned, or is there a workaround that I have missed? Regards, Ian.
RE: schema-based Index-time field boosting
Aaaargh! OK, I would like a document with (eg.) a title containing a term to score higher than one on (eg.) a summary containing the same term, all other things being equal. You seem to be arguing against field boosting in general, and I don't understand why! May as well let this drop since we don't seem to be talking about the same thing . . . but thanks anyway, Ian. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: 30 November 2009 23:05 To: solr-user@lucene.apache.org Subject: RE: schema-based Index-time field boosting : I am talking about field boosting rather than document boosting, ie. I : would like some fields (say eg. title) to be louder than others, : across ALL documents. I believe you are at least partially talking : about document boosting, which clearly applies on a per-document basis. index time boosts are all the same -- it doesn't matter if htey are field boosts or document boosts -- a document boost is just a field boost for every field in the document. : If it helps, consider a schema version of the following, from : org.apache.solr.common.SolrInputDocument: : : /** :* Adds a field with the given name, value and boost. If a field with : the name already exists, then it is updated to :* the new value and boost. :* :* @param name Name of the field to add :* @param value Value of the field :* @param boost Boost value for the field :*/ : public void addField(String name, Object value, float boost ) ... : Where a constant boost value is applied consistently to a given field. : That is what I was mistakenly hoping to achieve in the schema. I still : think it would be a good idea BTW. Regards, But now we're right back to what i was trying to explain before: index time boost values like these are only used as a multiplier in the fieldNorm. when included as part of the document data, you can specify a fieldBoost for fieldX of docA that's greater then the boost for fieldX of docB and that will make docA score higher then docB when fieldX contains the same number of matches and is hte same length -- but if you apply a constant boost of B to fieldX for every doc (which is what a feature to hardcode boosts in schema.xml might give you) then the net effect would be zero when scoring docA and docB, because the fieldNorm's for fieldX in both docs would include the exact same multiplier. -Hoss
Re: dismax query syntax to replace standard query
I believe you need to use the fq parameter with dismax (not to be confused with qf) to add a filter query in addition to the q parameter. So your text search value goes in q parameter (which searches on the fields you configure) and the rest of the query goes in the fq. Would that work? On Thu, Dec 3, 2009 at 7:28 PM, javaxmlsoapdev vika...@yahoo.com wrote: I have configured dismax handler to search against both title description fields now I have some other attributes on the page e.g. status, name etc. On the search page I have three fields for user to input search values 1)Free text search field (which searchs against both title description) 2)Status (multi select dropdown) 3)name(single select dropdown) I want to form query like textField1:value AND status:(Male OR Female) AND name:abc. I know first (textField1:value searchs against both title description as that's how I have configured dixmax in the configuration) but not sure how I can AND other attributes (in my case status name) note; standadquery looks like following (w/o using dixmax handler) title:testdescription:testname:JoestatusName:(Male OR Female) -- View this message in context: http://old.nabble.com/dismax-query-syntax-to-replace-standard-query-tp26631725p26631725.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: schema-based Index-time field boosting
Hi Chris, thanks for replying! OK, now I'm going to take the bait ;) I am talking about field boosting rather than document boosting, ie. I would like some fields (say eg. title) to be louder than others, across ALL documents. I believe you are at least partially talking about document boosting, which clearly applies on a per-document basis. If it helps, consider a schema version of the following, from org.apache.solr.common.SolrInputDocument: /** * Adds a field with the given name, value and boost. If a field with the name already exists, then it is updated to * the new value and boost. * * @param name Name of the field to add * @param value Value of the field * @param boost Boost value for the field */ public void addField(String name, Object value, float boost ) { SolrInputField field = _fields.get( name ); if( field == null || field.value == null ) { setField(name, value, boost); } else { field.addValue( value, boost ); } } Where a constant boost value is applied consistently to a given field. That is what I was mistakenly hoping to achieve in the schema. I still think it would be a good idea BTW. Regards, Ian. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: 23 November 2009 18:34 To: solr-user@lucene.apache.org Subject: RE: schema-based Index-time field boosting : Yeah, like I said, I was mistaken about setting field boost in : schema.xml - doesn't mean it's a bad idea though. At any rate, from : your penultimate sentence I reckon at least one of us is still confused : about field boosting, feel free to reply if you think it's me ;) Yeah ... i think it's you. like i said... : field boosting only makes sense if it's only applied to some of the : documents in the index, if every document has an index time boost on : fieldX, then that boost is meaningless. ...if there was a way to oost fields at index time that was configured in the schema.xml, then every doc would get that boost on it's instances of those fields but the only purpose of index time boosting is to indicate that one document is more significant then another doc -- if every doc gets the same boost, it becomes a No-OP. (think about the math -- field boosts become multipliers in the fieldNorm -- if every doc gets the same multiplier, then there is no net effect) -Hoss Web design and intelligent Content Management. www.twitter.com/gossinteractive Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Question about lat/long data type in localsolr
Heya... I think you need to use the newer types in your schema.xml, IE field name=lat type=tdouble indexed=true stored=true/ field name=lng type=tdouble indexed=true stored=true/ field name=geo_distance type=tdouble/ as doubles are no longer index-compatible (AFAIK) To use the above, make sure you have the tdouble types declared with fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 omitNorms=true positionIncrementGap=0/ in your types section. HTH Ian. 2009/11/21 Bertie Shen bertie.s...@gmail.com: Hey everyone, I used localsolr and locallucene to do local search. But I could not make longitude and latitude successfully indexed. During DataImport process, there is an exception. Do you have some ideas about it? I copy solrconfig.xml and schema.xml from your http://www.gissearch.com/localsolr. Only change I made is to replace names lat and lng by latitude and longtitude respectively, which are field name in my index. Should str in str name=latFieldlat/str should be replaced by double according to the following exception? Thanks. Solr log about Exception. exception messagejava.lang.ClassCastException: java.lang.Double cannot be cast to java.lan\ g.String/message frame classcom.pjaol.search.solr.update.LocalUpdaterProcessor/class methodprocessAdd/method line136/line /frame frame classorg.apache.solr.handler.dataimport.SolrWriter/class methodupload/method line75/line /frame frame classorg.apache.solr.handler.dataimport.DataImportHandler$1/class methodupload/method line292/line /frame frame classorg.apache.solr.handler.dataimport.DocBuilder/class methodbuildDocument/method line392/line /frame frame classorg.apache.solr.handler.dataimport.DocBuilder/class methoddoFullDump/method line242/line /frame frame classorg.apache.solr.handler.dataimport.DocBuilder/class methodexecute/method line180/line /frame frame classorg.apache.solr.handler.dataimport.DataImporter/class methoddoFullImport/method line331/line /frame frame classorg.apache.solr.handler.dataimport.DataImporter/class methodrunCmd/method line389/line /frame frame classorg.apache.solr.handler.dataimport.DataImporter$1/class methodrun/method line370/line /frame /exception How do I set up local indexing Here is what I have done to set up local indexing. 1) Download localsolr. I download it from http://developer.k-int.com/m2snapshots/localsolr/localsolr/1.5/ and put jar file (in my case, localsolr-1.5.jar) in your application's WEB_INF/lib directory of application server. 2) Download locallucene. I download it from http://sourceforge.net/projects/locallucene/ and put jar file (in my case, locallucene.jar in locallucene_r2.0/dist/ diectory) in your application's WEB_INF/lib directory of application server. I also need to copy gt2-referencing-2.3.1.jar, geoapi-nogenerics-2.1-M2.jar, and jsr108-0.01.jar under locallucene_r2.0/lib/ directory to WEB_INF/lib. Do not copy lucene-spatial-2.9.1.jar under Lucene codebase. The namespace has been changed from com.pjaol.blah.blah.blah to org.apache.blah blah. 3) Update your solrconfig.xml and schema.xml. I copy it from http://www.gissearch.com/localsolr.
RE: schema-based Index-time field boosting
Hi David, thanks for replying, The field boost attribute was put there by me back in the 1.3 days, when I somehow gained the mistaken impression that it was supposed to work! Of course, despite a lot of searching I haven't been able to find anything to back up my position ;) Unfortunately our code (intentionally) has no idea what index it is writing to so only a schema-based approach is really going to work for us. Of course, by now I am convinced that this might be a really good feature - I might get the chance to look into it in the near future - can anyone think of reasons why this might not work in practice? Regards, Ian. -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: 19 November 2009 19:29 To: solr-user@lucene.apache.org Subject: Re: Index-time field boosting not working? Hi Ian. Thanks for buying my book. The boost attribute goes on the field for the XML message you're sending to Solr. In your example you mistakenly placed it in the schema. FYI I use index time boosting as well as query time boosting. Although index time boosting isn't something I can change on a whim, I've found it to be far easier to control the scoring than say function queries which would be the query time substitute if the boost is a function of particular field values. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Nov 18, 2009, at 6:40 AM, Ian Smith wrote: I have the following field configured in schema.xml: field name=title type=text indexed=true stored=true omitNorms=false boost=3.0 / Where text is the type which came with the Solr distribution. I have not been able to get this configuration to alter any document scores, and if I look at the indexes in Luke there is no change in the norms (compared to an un-boosted equivalent). I have confirmed that document boosting works (via SolrJ), but our field boosting needs to be done in the schema. Am I doing something wrong (BTW I have tried using 3.0f as well, no difference)? Also, I have seen no debug output during startup which would indicate that fild boosting is being configured - should there be any? I have found no usage examples of this in the Solr 1.4 book, except a vague discouragement - is this a deprecated feature? TIA, Ian Web design and intelligent Content Management. www.twitter.com/gossinteractive Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Solr Cell text extraction
Hi Guys, I am trying to use Solr Cell to extract body content from documents, and also to pass along some literal field values. Trouble is, some of the literal fields contain spaces, colons etc. which cause a bad request exception in the server. However, if I URL encode these fields the encoding is not stripped away, so it is still present in search responses. Is there a way to pass literal values containing non-URL safe characters to Solr Cell? Regards, Ian. Web design and intelligent Content Management. www.twitter.com/gossinteractive Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
RE: Solr Cell text extraction - non-issue
Sorry guys, the bad request seemed to be caused elsewhere, no need to URL encode now. Ian. -Original Message- From: Ian Smith [mailto:ian.sm...@gossinteractive.com] Sent: 20 November 2009 15:26 To: solr-user@lucene.apache.org Subject: Solr Cell text extraction Hi Guys, I am trying to use Solr Cell to extract body content from documents, and also to pass along some literal field values. Trouble is, some of the literal fields contain spaces, colons etc. which cause a bad request exception in the server. However, if I URL encode these fields the encoding is not stripped away, so it is still present in search responses. Is there a way to pass literal values containing non-URL safe characters to Solr Cell? Regards, Ian. Web design and intelligent Content Management. www.twitter.com/gossinteractive Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
RE: Index-time field boosting not working?
Hi Otis, thanks for replying, Well I'm pretty sure it was there (and documented) in the 1.3 era. Strangely, it is still accepted in the Eclipse HTML editor, even for attribute completion (if you can, try it). But if it is truly deprecated, we will have to reassess part of our system design :( If you or anyone else here has any historical perspective on this, I'd be interested to hear about it. Regards, Ian, -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: 18 November 2009 22:55 To: solr-user@lucene.apache.org Subject: Re: Index-time field boosting not working? Can boost attribute really be specified for a field in the schema? I wasn't aware of that, and I don't see it on http://wiki.apache.org/solr/SchemaXml . Maybe you are mixing it with http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.2 2field.22 ? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Ian Smith ian.sm...@gossinteractive.com To: solr-user@lucene.apache.org Sent: Wed, November 18, 2009 6:40:11 AM Subject: Index-time field boosting not working? I have the following field configured in schema.xml: omitNorms=false boost=3.0 / Where text is the type which came with the Solr distribution. I have not been able to get this configuration to alter any document scores, and if I look at the indexes in Luke there is no change in the norms (compared to an un-boosted equivalent). I have confirmed that document boosting works (via SolrJ), but our field boosting needs to be done in the schema. Am I doing something wrong (BTW I have tried using 3.0f as well, no difference)? Also, I have seen no debug output during startup which would indicate that fild boosting is being configured - should there be any? I have found no usage examples of this in the Solr 1.4 book, except a vague discouragement - is this a deprecated feature? TIA, Ian Web design and intelligent Content Management. www.twitter.com/gossinteractive Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Index-time field boosting not working?
I have the following field configured in schema.xml: field name=title type=text indexed=true stored=true omitNorms=false boost=3.0 / Where text is the type which came with the Solr distribution. I have not been able to get this configuration to alter any document scores, and if I look at the indexes in Luke there is no change in the norms (compared to an un-boosted equivalent). I have confirmed that document boosting works (via SolrJ), but our field boosting needs to be done in the schema. Am I doing something wrong (BTW I have tried using 3.0f as well, no difference)? Also, I have seen no debug output during startup which would indicate that fild boosting is being configured - should there be any? I have found no usage examples of this in the Solr 1.4 book, except a vague discouragement - is this a deprecated feature? TIA, Ian Web design and intelligent Content Management. www.twitter.com/gossinteractive Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
Re: Problems downloading lucene 2.9.1
Heya Ryan... For me the big problem with adding http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/http://people.apache.org/%7Emikemccand/staging-area/rc3_lucene2.9.1/maven/to my build config is that the artifact names of the interim release are the same as the final objects will be.. thus once they are copied to a local repo maven won't bother to go looking for more recent versions, even if you blow away that temporary repo. Would it be possible to publish tagged rc-N releases to a public and more permanent repository where people can reference them and upgrade to the final release when it's available. Just a thought, cheers for all your hard work. Ian. 2009/11/2 Ryan McKinley ryan...@gmail.com On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote: On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote: Hi folks, as we are using an snapshot dependecy to solr1.4, today we are getting problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1 there). Which repository can i use to download it? They won't be there until 2.9.1 is officially released. We are trying to speed up the Solr release by piggybacking on the Lucene release, but this little bit is the one downside. Until then, you can add a repo to: http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/http://people.apache.org/%7Emikemccand/staging-area/rc3_lucene2.9.1/maven/
LocalSolr, Maven, build files and release candidates (Just for info) and spatial radius (A question)
Hallo All. I've been trying to prepare a project using localsolr for the impending (I hope) arrival of solr 1.4 and Lucene 2.9.1.. Here are some notes in case anyone else is suffering similarly. Obviously everything here may change by next week. First problem has been the lack of any stable maven based lucene and solr artifacts to wire into my poms. Because of that, and as an interim only measure, I've built the latest branches of the lucene 2.9.1 and solr 1.4 trees and made them into a *temporary* maven repository at http://developer.k-int.com/m2snapshots/. in there you can find all the jar artifacts tagged as xxx-ki-rc1 (For solr) and xxx-ki-rc3 (For lucene) and finally, a localsolr.localsolr build tagged as 1.5.2-rc1. Sorry for the naming, but I don't want these artifacts to clash with the real ones when they come along. This is really just for my own use, but I've seen messages and spoken to people who are really struggling to get their maven deps right, if this helps anyone, please feel free to use these until the real apache artifacts appear. I can't take any responsibility for their quality. All the poms have been altered to look for the correct dependent artifacts in the same repository, adding the stanza !-- Emergency repository for storing interim builds of lucene and solr whilst they sort their act out -- repositories repository idk-int-m2-snapshots/id nameK-int M2 Snapshots/name urlhttp://developer.k-int.com/m2snapshots/url releases enabledtrue/enabled /releases /repository /repositories to your pom will let you use these deps temporarily until we see an official build. If you're a maven developer and I've gone way around the houses with this, please tell me of an easier solution :) This repo *will* go away when the real builds turn up. The localsolr in this repo also contains the patches I've submitted (A good while ago) to the localsolr project to make it build with the lucene 2.9.1 rc3 as the downloadable dist is currently built against an older 2.9 release that had a different API (IE won't work with the new lucene and solr) All this means that there is a working localsolr build. Second up, I've also seen emails (And seen the exception myself) around asking about the following when trying to get all these revisions working together. java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really a LONG?) There are some threads out there telling you that the Lucene indexes are not binary compatible between versions, but if you're using localsolr, what you really need to know is: 1) Make sure that your schema.xml contains at least the following fieldType defs fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 omitNorms=true positionIncrementGap=0/ 2) Convert your old solr sdouble fields to tdoubles: field name=lat type=tdouble indexed=true stored=true/ field name=lng type=tdouble indexed=true stored=true/ dynamicField name=_local* type=tdouble indexed=true stored=true/ Pretty sure you would need to rebuild your indexes. Ok, with those changes I managed to get a working spatial search. My only problem now is that the radius param on the command line seems to need to be way bigger than it needs to be in order to find anything. Specifically, if I search with a radius of 220 I get a record back which marks it's geo_distance as 83.76888211666025. Shuffling the radius around ends up that a radius of 205 returns that doc, 204 and it's filtered. I'm going to dig into this now, but if anyone knows about this I'd really appreciate any help. Cheers all, hope this is of use to someone out there, if anyone has corrections/comments I'd really appreciate any info. Best, Ian.
Re: Solr via ruby
Hi, Thanks for the discussion. We use the distributed option so I am not sure embedded is possible. As you also guessed, we use haproxy for load balancing and failover between replicas of the shards so giving this up for a minor performance boost is probably not wise. So essentially we have: User - HTTP Load Balancer - Mogrel Cluster - Haproxy - N x Solr Shards and it looks like that is the standard setup for performance from what you suggest here and most of the performance tweaks I thought of are already in use. Ian. On Fri, Sep 18, 2009 at 3:09 AM, Erik Hatcher erik.hatc...@gmail.comwrote: On Sep 18, 2009, at 1:09 AM, rajan chandi wrote: We are planning to use the external Solr on tomcat for scalability reasons. We thought that EmbeddedSolrServer uses HTTP too to talk with Ruby and vise-versa as in acts_as_solr ruby plugin. EmbeddedSolrServer is a way to run Solr as an API (like Lucene) rather than with any web container involved at all. In other words, only Java can use EmbeddedSolrServer (which means JRuby works great). The acts_as_solr plugin uses the solr-ruby library to communicate with Solr. Under solr-ruby, it's HTTP with ruby (wt=ruby) formatted responses for searches, and documents being indexed get converted to Solr's XML format and POSTed to the Solr URL used to open the Solr::Connection Erik If Ruby is not using the HTTP to talk EmbeddedSolrServer, what is it using? Thanks and Regards Rajan Chandi On Thu, Sep 17, 2009 at 9:44 PM, Erik Hatcher erik.hatc...@gmail.com wrote: On Sep 17, 2009, at 11:40 AM, Ian Connor wrote: Is there any support for connection pooling or a more optimized data exchange format? The solr-ruby library (as do other Solr + Ruby libraries) use the ruby response format and eval it. solr-ruby supports keeping the HTTP connection alive too. We are looking at any further ways to optimize the solr queries so we can possibly make more of them in the one request. The JSON like format seems pretty tight but I understand when the distributed search takes place it uses a binary protocol instead of text. I wanted to know if that was available or could be available via the ruby library. Is it possible to host a local shard and skip HTTP between ruby and solr? If you use JRuby you can do some fancy stuff, like use the javabin update and response formats so no XML is involved, and you could also use Solr's EmbeddedSolrServer to avoid HTTP. However, in practice rarely is HTTP the bottleneck and actually offers a lot of advantages, such as easy commodity load balancing and caching. But JRuby + Solr is a very beautiful way to go! If you're using MRI Ruby, though, you don't really have any options other than to go over HTTP. You could use json or ruby formatted responses - I'd be curious to see some performance numbers comparing those two. Erik -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor