Re: 8 Shards of Cloud with 4.10.3.
On 2/25/2015 5:50 AM, Benson Margulies wrote: So, found the following line in the guide: java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar using a completely clean, new, solr_home. In my own bootstrap dir, I have my own solrconfig.xml and schema.xml, and I modified to have: -DnumShards=8 -DmaxShardsPerNode=8 When I went to start loading data into this, I failed: Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No registered leader was found after waiting for 4000ms , collection: rni slice: shard4 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271) at com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53) with corresponding log traffic in the solr log. The cloud page in the Solr admin app shows the IP address in green. It's a bit hard to read in general, it's all squished up to the top. The way I would do it would be to start Solr *only* with the zkHost parameter. If you're going to use embedded zookeeper, I guess you would use zkRun instead. Once I had Solr running in cloud mode, I would upload the config to zookeeper using zkcli, and create the collection using the Collections API, including things like numShards and maxShardsPerNode on that CREATE call, not as startup properties. Then I would completely reindex my data into the new collection. It's a whole lot cleaner than trying to convert non-cloud to cloud and split shards. Thanks, Shawn
Re: Problem with queries that includes NOT
As a general proposition, your first stop with any query interpretation questions should be to add the debigQuery=true parameter and look at the parsed_query in the query response which shows how the query is really interpreted. -- Jack Krupansky On Wed, Feb 25, 2015 at 8:21 AM, david.dav...@correo.aeat.es wrote: Hi Shawn, thank you for your quick response. I will read your links and make some tests. Regards, David Dávila DIT - 915828763 De: Shawn Heisey apa...@elyograg.org Para: solr-user@lucene.apache.org, Fecha: 25/02/2015 13:23 Asunto: Re: Problem with queries that includes NOT On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote: We have problems with some queries. All of them include the tag NOT, and in my opinion, the results don´t make any sense. First problem: This query NOT Proc:ID01returns 95806 results, however this one NOT Proc:ID01 OR FileType:PDF_TEXT returns 11484 results. But it's impossible that adding a tag OR the query has less number of results. Second problem. Here the problem is because of the brackets and the NOT tag: This query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE returns 0 documents. But this query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) returns 53 documents, which is correct. So, the problem is the position of the bracket. I have checked the same query without NOTs, and it works fine returning the same number of results in both cases. So, I think the problem is the combination of the bracket positions and the NOT tag. For the first query, there is a difference between NOT condition1 OR condition2 and NOT (condition1 OR condition2) ... I can imagine the first one increasing the document count compared to just NOT condition1 ... the second one wouldn't increase it. Boolean queries in Solr (and very likely Lucene as well) do not always do what people expect. http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/ https://lucidworks.com/blog/why-not-and-or-and-not/ As mentioned in the second link above, you'll get better results if you use the prefix operators with explicit parentheses. One word of warning, though -- the prefix operators do not work correctly if you change the default operator to AND. Thanks, Shawn
Re: Stop solr query
Moshe, if you take a thread dump while a particular query stuck (via jstack of in SolrAdmin tab), it may explain where exactly it's stalled, just check the longest stack trace. FWIW, in 4.x timeallowed is checked only while documents are collected, and in 5 it's also checked during query expansion (see http://lucidworks.com/blog/solr-5-0/ now cut-offs requests https://issues.apache.org/jira/browse/SOLR-5986 during the query-expansion stage as well ). however I'm not sure it has place (long query expansion) with hon-synonyms. On Wed, Feb 25, 2015 at 3:21 PM, Moshe Recanati mos...@kmslh.com wrote: Hi Shawn, We checked this option and it didn't solve our problem. We're using https://github.com/healthonnet/hon-lucene-synonyms for query based synonyms. While running query with high number of words that have high number of synonyms the query got stuck and solr memory is exhausted. We tried to use this parameter suggested by you however it didn't stop the query and solve the issue. Please let me know if there is other option to tackle it. Today it might be high number of words that cause the issue and tomorrow it might be other something wrong. We can't rely only on user input check. Thank you in advance. Regards, Moshe Recanati SVP Engineering Office + 972-73-2617564 Mobile + 972-52-6194481 Skype: recanati More at: www.kmslh.com | LinkedIn | FB -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Monday, February 23, 2015 5:49 PM To: solr-user@lucene.apache.org Subject: Re: Stop solr query On 2/23/2015 7:23 AM, Moshe Recanati wrote: Recently there were some scenarios in which queries that user sent to solr got stuck and increased our solr heap. Is there any option to kill or timeout query that wasn't returned from solr by external command? The best thing you can do is examine all user input and stop such queries before they execute, especially if they are the kind of query that will cause your heap to grow out of control. The timeAllowed parameter can abort a query that takes too long in certain phases of the query. In recent months, Solr has been modified so that timeAllowed will take effect during more query phases. It is not a perfect solution, but it can be better than nothing. http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed Be aware that sometimes legitimate queries will be slow, and using timeAllowed may cause those queries to fail. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Connect Solr with ODBC to Excel
Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit
Re: 8 Shards of Cloud with 4.10.3.
On Wed, Feb 25, 2015 at 8:04 AM, Shawn Heisey apa...@elyograg.org wrote: On 2/25/2015 5:50 AM, Benson Margulies wrote: So, found the following line in the guide: java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar using a completely clean, new, solr_home. In my own bootstrap dir, I have my own solrconfig.xml and schema.xml, and I modified to have: -DnumShards=8 -DmaxShardsPerNode=8 When I went to start loading data into this, I failed: Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No registered leader was found after waiting for 4000ms , collection: rni slice: shard4 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271) at com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53) with corresponding log traffic in the solr log. The cloud page in the Solr admin app shows the IP address in green. It's a bit hard to read in general, it's all squished up to the top. The way I would do it would be to start Solr *only* with the zkHost parameter. If you're going to use embedded zookeeper, I guess you would use zkRun instead. Once I had Solr running in cloud mode, I would upload the config to zookeeper using zkcli, and create the collection using the Collections API, including things like numShards and maxShardsPerNode on that CREATE call, not as startup properties. Then I would completely reindex my data into the new collection. It's a whole lot cleaner than trying to convert non-cloud to cloud and split shards. Shawn, I _am_ starting from clean. However, I didn't find a recipe for what you suggest as a process, and (following Hoss' suggestion) I found the recipe above with the boostrap_confdir scheme. I am mostly confused as to how I supply my solrconfig.xml and schema.xml when I follow the process you are suggesting. I know I'm verging on vampirism here, but if you could possibly find the time to turn your paragraph into either a pointer to a recipe or the command lines in a bit more detail, I'd be exceedingly grateful. Thanks, benson Thanks, Shawn
Re: Stop solr query
On 2/25/2015 5:21 AM, Moshe Recanati wrote: We checked this option and it didn't solve our problem. We're using https://github.com/healthonnet/hon-lucene-synonyms for query based synonyms. While running query with high number of words that have high number of synonyms the query got stuck and solr memory is exhausted. We tried to use this parameter suggested by you however it didn't stop the query and solve the issue. Please let me know if there is other option to tackle it. Today it might be high number of words that cause the issue and tomorrow it might be other something wrong. We can't rely only on user input check. If legitimate queries use a lot of memory, you'll either need to increase the java heap so it can deal with the increased memory requirements, or you'll have to take steps to decrease memory usage. Those steps might include changes to your application code to detect problematic queries before they happen, and/or educating your users about how to properly use the search. Lucene and Solr are constantly making advances in memory efficiency, so making sure you're always on the latest version goes a long way towards keeping Solr efficient. Thanks, Shawn
Re: Facet on TopDocs
Hi, The facet component works with the whole result set, so you can't get the facets for your topN documents. A naive way you can fulfill your requirement is two implement it in two steps: - Request your data and recover the doc ids. - Create a new query with the selected ids (id:id1 OR id:id2 OR ... OR id:100) and facet over the result. Regards. On Wed, Feb 25, 2015 at 10:34 AM, kakes junkkak...@gmail.com wrote: We are trying to limit the number of facets returned only to the top 100 docs and not the complete result set.. Is there a way of accessing topDocs in the custom Faceting component? or Can the scores of the docID's in the resultset be accessed in the Facet Component? -- View this message in context: http://lucene.472066.n3.nabble.com/Facet-on-TopDocs-tp4188767.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with queries that includes NOT
Hi Shawn, thank you for your quick response. I will read your links and make some tests. Regards, David Dávila DIT - 915828763 De: Shawn Heisey apa...@elyograg.org Para: solr-user@lucene.apache.org, Fecha: 25/02/2015 13:23 Asunto: Re: Problem with queries that includes NOT On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote: We have problems with some queries. All of them include the tag NOT, and in my opinion, the results don´t make any sense. First problem: This query NOT Proc:ID01returns 95806 results, however this one NOT Proc:ID01 OR FileType:PDF_TEXT returns 11484 results. But it's impossible that adding a tag OR the query has less number of results. Second problem. Here the problem is because of the brackets and the NOT tag: This query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE returns 0 documents. But this query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) returns 53 documents, which is correct. So, the problem is the position of the bracket. I have checked the same query without NOTs, and it works fine returning the same number of results in both cases. So, I think the problem is the combination of the bracket positions and the NOT tag. For the first query, there is a difference between NOT condition1 OR condition2 and NOT (condition1 OR condition2) ... I can imagine the first one increasing the document count compared to just NOT condition1 ... the second one wouldn't increase it. Boolean queries in Solr (and very likely Lucene as well) do not always do what people expect. http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/ https://lucidworks.com/blog/why-not-and-or-and-not/ As mentioned in the second link above, you'll get better results if you use the prefix operators with explicit parentheses. One word of warning, though -- the prefix operators do not work correctly if you change the default operator to AND. Thanks, Shawn
Drop obsolete KEYS files from dist site
Hi, folks. Currently KEYS file is present in: - www.apache.org/dist/lucene/solr/version/KEYS - www.apache.org/dist/lucene/solr/KEYS - www.apache.org/dist/lucene/KEYS Last two KEYS files are obsolete (both modified at Feb 2014). Some actual keys used for release process aren't present in them. I think, it'll be good to drop them to avoid their usage for release artifact verification. -- Best regards, Konstantin Gribov
RE: Stop solr query
Hi Shawn, We checked this option and it didn't solve our problem. We're using https://github.com/healthonnet/hon-lucene-synonyms for query based synonyms. While running query with high number of words that have high number of synonyms the query got stuck and solr memory is exhausted. We tried to use this parameter suggested by you however it didn't stop the query and solve the issue. Please let me know if there is other option to tackle it. Today it might be high number of words that cause the issue and tomorrow it might be other something wrong. We can't rely only on user input check. Thank you in advance. Regards, Moshe Recanati SVP Engineering Office + 972-73-2617564 Mobile + 972-52-6194481 Skype : recanati More at: www.kmslh.com | LinkedIn | FB -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Monday, February 23, 2015 5:49 PM To: solr-user@lucene.apache.org Subject: Re: Stop solr query On 2/23/2015 7:23 AM, Moshe Recanati wrote: Recently there were some scenarios in which queries that user sent to solr got stuck and increased our solr heap. Is there any option to kill or timeout query that wasn't returned from solr by external command? The best thing you can do is examine all user input and stop such queries before they execute, especially if they are the kind of query that will cause your heap to grow out of control. The timeAllowed parameter can abort a query that takes too long in certain phases of the query. In recent months, Solr has been modified so that timeAllowed will take effect during more query phases. It is not a perfect solution, but it can be better than nothing. http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed Be aware that sometimes legitimate queries will be slow, and using timeAllowed may cause those queries to fail. Thanks, Shawn
solr index time boosting.
Hi all, We are trying to deboost some documents while indexing depending on their text available some thing like this *doc boost=0.03 * field name=pns![CDATA[Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. Testing product - Water Bottle. ]]/field /doc my questions 1.are we going right or not? 2.in order to get the difference how can we deboost this kind of documents?(any other way to deboost) Thanks in advance. -- ckreddybh. chaitu...@gmail.com
Re: Problem with queries that includes NOT
On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote: We have problems with some queries. All of them include the tag NOT, and in my opinion, the results don´t make any sense. First problem: This query NOT Proc:ID01returns 95806 results, however this one NOT Proc:ID01 OR FileType:PDF_TEXT returns 11484 results. But it's impossible that adding a tag OR the query has less number of results. Second problem. Here the problem is because of the brackets and the NOT tag: This query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE returns 0 documents. But this query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) returns 53 documents, which is correct. So, the problem is the position of the bracket. I have checked the same query without NOTs, and it works fine returning the same number of results in both cases. So, I think the problem is the combination of the bracket positions and the NOT tag. For the first query, there is a difference between NOT condition1 OR condition2 and NOT (condition1 OR condition2) ... I can imagine the first one increasing the document count compared to just NOT condition1 ... the second one wouldn't increase it. Boolean queries in Solr (and very likely Lucene as well) do not always do what people expect. http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/ https://lucidworks.com/blog/why-not-and-or-and-not/ As mentioned in the second link above, you'll get better results if you use the prefix operators with explicit parentheses. One word of warning, though -- the prefix operators do not work correctly if you change the default operator to AND. Thanks, Shawn
Re: 8 Shards of Cloud with 4.10.3.
So, found the following line in the guide: java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar using a completely clean, new, solr_home. In my own bootstrap dir, I have my own solrconfig.xml and schema.xml, and I modified to have: -DnumShards=8 -DmaxShardsPerNode=8 When I went to start loading data into this, I failed: Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No registered leader was found after waiting for 4000ms , collection: rni slice: shard4 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271) at com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53) with corresponding log traffic in the solr log. The cloud page in the Solr admin app shows the IP address in green. It's a bit hard to read in general, it's all squished up to the top. On Tue, Feb 24, 2015 at 4:33 PM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Feb 24, 2015 at 4:27 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Unfortunately, this is all 5.1 and instructs me to run the 'start from : scratch' process. a) checkout the left nav of any ref guide page webpage which has a link to Older Versions of this Guide (PDF) b) i'm not entirely sure i understand what you're asking, but i'm guessing you mean... * you have a fully functional individual instance of Solr, with a single core * you only want to run that one single instance of the Solr process * you want tha single solr process to be a SolrCould of one node, but replace your single core with a collection that is divided into 8 shards. * presumably: you don't care about replication since you are only trying to run one node. what you want to look into (in the 4.10 ref guide) is how to bootstrap a SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr to take the configs from your single core and uploda them to zk as a configset, and register that single core as a collection. That should give you a single instance of solrcloud, with a single collection, consisting of one shard (your original core) Then you should be able to use the SPLITSHARD command to split your single shard into 2 shards, and then split them again, etc... (i don't think you can split directly to 8-sub shards with a single command) FWIW: unless you no longer have access to the original data, it would almost certainly be a lot easier to just start with a clean install of Solr in cloud mode, then create a collection with 8 shards, then re-index your data. OK, now I'm good to go. Thanks. -Hoss http://www.lucidworks.com/
Re: 8 Shards of Cloud with 4.10.3.
A little more data. Note that the cloud status shows the black bubble for a leader. See http://i.imgur.com/k2MhGPM.png. org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: rni slice: shard4 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:568) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:551) at org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteByQuery(DistributedUpdateProcessor.java:1358) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:1226) at org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121) at org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55) On Wed, Feb 25, 2015 at 9:44 AM, Benson Margulies bimargul...@gmail.com wrote: On Wed, Feb 25, 2015 at 8:04 AM, Shawn Heisey apa...@elyograg.org wrote: On 2/25/2015 5:50 AM, Benson Margulies wrote: So, found the following line in the guide: java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar using a completely clean, new, solr_home. In my own bootstrap dir, I have my own solrconfig.xml and schema.xml, and I modified to have: -DnumShards=8 -DmaxShardsPerNode=8 When I went to start loading data into this, I failed: Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No registered leader was found after waiting for 4000ms , collection: rni slice: shard4 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285) at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271) at com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53) with corresponding log traffic in the solr log. The cloud page in the Solr admin app shows the IP address in green. It's a bit hard to read in general, it's all squished up to the top. The way I would do it would be to start Solr *only* with the zkHost parameter. If you're going to use embedded zookeeper, I guess you would use zkRun instead. Once I had Solr running in cloud mode, I would upload the config to zookeeper using zkcli, and create the collection using the Collections API, including things like numShards and maxShardsPerNode on that CREATE call, not as startup properties. Then I would completely reindex my data into the new collection. It's a whole lot cleaner than trying to convert non-cloud to cloud and split shards. Shawn, I _am_ starting from clean. However, I didn't find a recipe for what you suggest as a process, and (following Hoss' suggestion) I found the recipe above with the boostrap_confdir scheme. I am mostly confused as to how I supply my solrconfig.xml and schema.xml when I follow the process you are suggesting. I know I'm verging on vampirism here, but if you could possibly find the time to turn your paragraph into either a pointer to a recipe or the command lines in a bit more detail, I'd be exceedingly grateful. Thanks, benson Thanks, Shawn
Re: 8 Shards of Cloud with 4.10.3.
It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap', 'upconfig', and uploading a solr.xml. When I use upconfig, it might work, but it sure is noise: benson@ip-10-111-1-103:/data/solr+rni$ 554331 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN org.apache.zookeeper.server.NIOServerCnxn – caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14bc16c5e660003, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) On Wed, Feb 25, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 2/25/2015 8:35 AM, Benson Margulies wrote: Do I need a zkcli bootstrap or do I start with upconfig? What port does zkRun put zookeeper on? I personally would not use bootstrap options. They are only meant to be used once, when converting from non-cloud, but many people who use them do NOT use them only once -- they include them in their startup scripts and use them on every startup. The whole thing becomes extremely confusing. I would just use zkcli and the Collections API, so nothing ever happens that you don't explicitly request. I believe that the port for embedded zookeeper (zkRun) is the jetty listen port plus 1000, so 9983 if jetty.port is 8983 or not set. Thanks, Shawn
Re: apache solr - dovecot - some search fields works some dont
Hi Alex, I get 1 error on start up Is the error below serious:- 2/25/2015, 11:32:30 PM ERROR SolrCore org.apache.solr.common.SolrException: undefined field text org.apache.solr.common.SolrException: undefined field text at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1269) at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:434) at org.apache.lucene.analysis.DelegatingAnalyzerWrapper$DelegatingReuseStrategy.getReusableComponents(DelegatingAnalyzerWrapper.java:74) at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:175) at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:207) at org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:374) at org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:742) at org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:541) at org.apache.solr.parser.QueryParser.Term(QueryParser.java:299) at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:185) at org.apache.solr.parser.QueryParser.Query(QueryParser.java:107) at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:96) at org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:151) at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50) at org.apache.solr.search.QParser.getQuery(QParser.java:141) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:148) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:64) at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1739) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) On Wed, Feb 25, 2015 at 3:08 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: The field definition looks fine. It's not storing any content (stored=false) but is indexing, so you should find the records but not see the body in them. Not seeing a log entry is more of a worry. Are you sure the request even made it to Solr? Can you see anything in Dovecot's logs? Or in Solr's access.logs (Actually Jetty/Tomcat's access logs that may need to be enabled first). At this point, you don't have enough information to fix anything. You need to understand what's different between request against subject vs. the request against body. I would break the communication in three stages: 1) What Dovecote sent 2) What Solr received 3) What Solr sent back I don't know your skill levels or your system setup to advise specifically, but Network tracer (e.g. Wireshark) is good for 1. Logs are good for 2. Using the query from 1) and manually running it against Solr is good for 3). Hope this helps, Alex. On 24 February 2015 at 12:35, Kevin Laurie superinterstel...@gmail.com wrote: field name=body type=text indexed=true stored=false / Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
AW: performance issues with geofilt
Hello David, thanks for your answer. In the meantime I found the memory hint too in http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Sorting_and_RelevancySo Maybe we switch to LatLonType for this kind of searches. But the RPT is also needed as we want to support search by arbitrary polygons. I'm also able to use the sort=geodist() asc. This works well when I modify the parameters to: q=*:*fq=typ:strassefq={!geofilt}sfield=geometrypt=51.370570625523,12.369290471603d=1.0sort=geofilt() asc Kind regards, Dirk Tue, 24 Feb 2015 19:42:03 GMT, david.w.smi...@gmail.com wrote: Hi Dirk, The RPT field type can be used for distance sorting/boosting but it's a memory pig when used as-such so don't do it unless you have to. You only have to if you have a multi-valued point field. If you have single-valued, use LatLonType specifically for distance sorting. Your sample query doesn't parse correctly for multiple reasons. You can't put a query into the sort parameter as you have done it. You have to do sort=query($sortQuery) ascsortQuery=... or a slightly different equivalent variation. Lets say you do that... still, I don't recommend this syntax when you simply want distance sort - just use geodist(), as in: sort=geodist() asc. If you want to use this syntax such as to sort by recipDistance, then it would look like this (note the filter=false hint to the spatial query parser, which otherwise is unaware it shouldn't bother actually search/filter): sort=query($sortQuery) descsortQuery={!geofilt score=recipDistance filter=false sfield=geometry pt=51.3,12.3 d=1.0} If you are able to use geodist() and still find it slow, there are alternatives involving using projected data and then with simply euclidean calculations, sqedist(): https://wiki.apache.org/solr/FunctionQuery#sqedist_-_Squared_Euclidean_Distance ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Tue, Feb 24, 2015 at 6:12 AM, dirk.thalh...@bkg.bund.de wrote: Hello, we are using solr 4.10.1. There are two cores for different use cases with around 20 million documents (location descriptions) per core. Each document has a geometry field which stores a point and a bbox field which stores a bounding box. Both fields are defined with: fieldType name=t_geometry class=solr.SpatialRecursivePrefixTreeFieldType spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory geo=true distErrPct=0.025 maxDistErr=0.9 units=degrees / I'm currently trying to add a location search (find all documents around a point). My intention is to add this as filter query, so that the user is able to do an additional keyword search. These are the query parameters so far: q=*:*fq=typ:strassefq={!geofilt sfield=geometry pt=51.370570625523,12.369290471603 d=1.0} To sort the documents by their distance to the requested point, I added following sort parameter: sort={!geofilt sort=distance sfield: geometry pt=51.370570625523,12.369290471603 d=1.0} asc Unfortunately I'm experiencing here some major performance/memory problems. The first distance query on a core takes over 10 seconds. In my first setup the same request to the second core completely blocked the server and caused an OutOfMemoryError. I had to increase the memory to 16 GB and now it seems to work for the geometry field. Anyhow the first request after a server restart takes some time and when I try it with the bbox field after a requested on the geometry field in both cores, the server blocks again. Can anyone explain why the distance needs so much memory? Can this be optimized? Kind regards, Dirk
Re: Solr Document expiration with TTL
Reading https://lucidworks.com/blog/document-expiration/ It seems that your Delete check interval granularity is 30 seconds, but your TTL is 10 seconds. Have you tried setting autoDeletePeriodSeconds to something like 2 seconds and seeing if the problem goes away due to more frequent checking of items to delete? Also, even with the current setup, you should be observing the record being deleted if not 10 seconds after than 30 seconds. Are you seeing it not deleted at all? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 01:51, Makailol Charls 4extrama...@gmail.com wrote: Hello, We are trying to add documents in solr with ttl defined(document expiration feature), which is expected to expire at specified time, but it is not. Following are the settings we have defined in solrconfig.xml and managed-schema. solr version : 5.0.0 *solrconfig.xml* --- updateRequestProcessorChain default=true processor class=solr.processor.DocExpirationUpdateProcessorFactory int name=autoDeletePeriodSeconds30/int str name=ttlFieldNametime_to_live_s/str str name=expirationFieldNameexpire_at_dt/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain *managed-schema* --- field name=id type=string indexed=true stored=true multiValued=false / field name=time_to_live_s type=string stored=true multiValued=false / field name=expire_at_dt type=date stored=true multiValued=false / *solr query* Following query posts a document and sets expire_at_dt explicitly. That is working perfectly ok and ducument expires at defined time. curl -X POST -H 'Content-Type: application/json' ' http://localhost:8983/solr/collection1/update?commit=true' -d '[{ id:10seconds,expire_at_dt:NOW+10SECONDS}]' But when trying to post with TTL (following query), document does not expire after given time. curl -X POST -H 'Content-Type: application/json' ' http://localhost:8983/solr/collection1/update?commit=true' -d '[{ id:10seconds,time_to_live_s:+10SECONDS}]' Any help would be appreciated. Thanks, Makailol
Re: New leader/replica solution for HDFS
bq: And the data sync between leader/replica is always a problem Not quite sure what you mean by this. There shouldn't need to be any synching in the sense that the index gets replicated, the incoming documents should be sent to each node (and indexed to HDFS) as they come in. bq: There is duplicate index computing on Replilca side. Yes, that's the design of SolrCloud, explicitly to provide data safety. If you instead rely on the leader to index and somehow pull that indexed form to the replica, then you will lose data if the leader goes down before sending the indexed form. bq: My thought is that the leader and the replica all bind to the same data index directory. This is unsafe. They would both then try to _write_ to the same index, which can easily corrupt indexes and/or all but the first one to access the index would be locked out. All that said, the HDFS triple-redundancy compounded with the Solr leaders/replicas redundancy means a bunch of extra storage. You can turn the HDFS replication down to 1, but that has other implications. Best, Erick On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote: We used HDFS as our Solr index storage and we really have a heavy update load. We had met much problems with current leader/replica solution. There is duplicate index computing on Replilca side. And the data sync between leader/replica is always a problem. As HDFS already provides data replication on data layer, could Solr provide just service layer replication? My thought is that the leader and the replica all bind to the same data index directory. And the leader will build up index for new request, the replica will just keep update the index version with the leader(such as a soft commit periodically? ). If the leader lost then the replica will take the duty immediately. Thanks for any suggestion of this idea. -- View this message in context: http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html Sent from the Solr - User mailing list archive at Nabble.com.
Can't index all docs in a local folder with DIH in Solr 5.0.0
I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Basic Multilingual search capability
Hi Trey, Thanks for the detailed response and the link to the talk, it was very informative. Yes looking at the current system requirements ICUTokenizer might be the best bet for our use case. MultiTextField mentioned in the jira SOLR-6492 has some cool features and definitely looking forward to trying out once its integrated to main. Thanks, Rishi. -Original Message- From: Trey Grainger solrt...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 1:40 am Subject: Re: Basic Multilingual search capability Hi Rishi, I don't generally recommend a language-insensitive approach except for really simple multilingual use cases (for most of the reasons Walter mentioned), but the ICUTokenizer is probably the best bet you're going to have if you really want to go that route and only need exact-match on the tokens that are parsed. It won't work that well for all languages (CJK languages, for example), but it will work fine for many. It is also possible to handle multi-lingual content in a more intelligent (i.e. per-language configuration) way in your search index, of course. There are three primary strategies (i.e. ways that actually work in the real world) to do this: 1) create a separate field for each language and search across all of them at query time 2) create a separate core per language-combination and search across all of them at query time 3) invoke multiple language-specific analyzers within a single field's analyzer and index/query using one or more of those language's analyzers for each document/query. These are listed in ascending order of complexity, and each can be valid based upon your use case. For at least the first and third cases, you can use index-time language detection to map to the appropriate fields/analyzers if you are otherwise unaware of the languages of the content from your application layer. The third option requires custom code (included in the large Multilingual Search chapter of Solr in Action http://solrinaction.com and soon to be contributed back to Solr via SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it enables you to index an arbitrarily large number of languages into the same field if needed, while preserving language-specific analysis for each language. I presented in detail on the above strategies at Lucene/Solr Revolution last November, so you may consider checking out the presentation and/or slides to asses if one of these strategies will work for your use case: http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/ For the record, I'd highly recommend going with the first strategy (a separate field per language) if you can, as it is certainly the simplest of the approaches (albeit the one that scales the least well after you add more than a few languages to your queries). If you want to stay simple and stick with the ICUTokenizer then it will work to a point, but some of the problems Walter mentioned may eventually bite you if you are supporting certain groups of languages. All the best, Trey Grainger Co-author, Solr in Action Director of Engineering, Search Recommendations @ CareerBuilder On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org wrote: It isn’t just complicated, it can be impossible. Do you have content in Chinese or Japanese? Those languages (and some others) do not separate words with spaces. You cannot even do word search without a language-specific, dictionary-based parser. German is space separated, except many noun compounds are not space-separated. Do you have Finnish content? Entire prepositional phrases turn into word endings. Do you have Arabic content? That is even harder. If all your content is in space-separated languages that are not heavily inflected, you can kind of do OK with a language-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, There is no specific language list. For example: the documents that needs to be indexed are emails or any messages for a global customer base. The messages back and forth could be in any language or mix of languages. I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results. Now it would be great if it had capability to tokenize email addresses (ex:he...@aol.com- i think standardTokenizer already does this), filenames (здравствуйте.pdf), but
Re: 8 Shards of Cloud with 4.10.3.
Do I need a zkcli bootstrap or do I start with upconfig? What port does zkRun put zookeeper on? On Feb 25, 2015 10:15 AM, Shawn Heisey apa...@elyograg.org wrote: On 2/25/2015 7:44 AM, Benson Margulies wrote: Shawn, I _am_ starting from clean. However, I didn't find a recipe for what you suggest as a process, and (following Hoss' suggestion) I found the recipe above with the boostrap_confdir scheme. I am mostly confused as to how I supply my solrconfig.xml and schema.xml when I follow the process you are suggesting. I know I'm verging on vampirism here, but if you could possibly find the time to turn your paragraph into either a pointer to a recipe or the command lines in a bit more detail, I'd be exceedingly grateful. I'm willing to help in any way that I can. Normally in the conf directory for a non-cloud core you have solrconfig.xml and schema.xml, plus any other configs referenced by those files, like synomyms.txt, dih-config.xml, etc. In cloud terms, the directory containing these files is a confdir. It's best to keep the on-disk copy of your configs completely outside of the solr home so there's no confusion about what configurations are active. On-disk cores for solrcloud do not need or use a conf directory. The cloud-scripts/zkcli.sh (or zkcli.bat) script has an upconfig command with -confdir and -confname options. When doing upconfig, the zkHost value goes on the -z option to zkcli, and you only need to list one of your zookeeper hosts, although it is perfectly happy if you list them all. You would point -confdir at a directory containing the config files mentioned earlier, and -confname is the name that the config has in zookeeper, which you would then use on the collection.configName parameter for the Collections API call. Once the config is uploaded, here's an example call to that API for creating a collection: http://server:port /solr/admin/collections?action=CREATEname=testnumShards=8replicationFactor=1collection.configName=testcfgmaxShardsPerNode=8 If this is not enough detail, please let me know which part you need help with. Thanks, Shawn
RE: Stop solr query
HI Mikhail, We're using 4.7.1. This means I can't stop the search. I think this is mandatory feature. Regards, Moshe Recanati SVP Engineering Office + 972-73-2617564 Mobile + 972-52-6194481 Skype : recanati More at: www.kmslh.com | LinkedIn | FB -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Wednesday, February 25, 2015 3:42 PM To: solr-user Subject: Re: Stop solr query Moshe, if you take a thread dump while a particular query stuck (via jstack of in SolrAdmin tab), it may explain where exactly it's stalled, just check the longest stack trace. FWIW, in 4.x timeallowed is checked only while documents are collected, and in 5 it's also checked during query expansion (see http://lucidworks.com/blog/solr-5-0/ now cut-offs requests https://issues.apache.org/jira/browse/SOLR-5986 during the query-expansion stage as well ). however I'm not sure it has place (long query expansion) with hon-synonyms. On Wed, Feb 25, 2015 at 3:21 PM, Moshe Recanati mos...@kmslh.com wrote: Hi Shawn, We checked this option and it didn't solve our problem. We're using https://github.com/healthonnet/hon-lucene-synonyms for query based synonyms. While running query with high number of words that have high number of synonyms the query got stuck and solr memory is exhausted. We tried to use this parameter suggested by you however it didn't stop the query and solve the issue. Please let me know if there is other option to tackle it. Today it might be high number of words that cause the issue and tomorrow it might be other something wrong. We can't rely only on user input check. Thank you in advance. Regards, Moshe Recanati SVP Engineering Office + 972-73-2617564 Mobile + 972-52-6194481 Skype: recanati More at: www.kmslh.com | LinkedIn | FB -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Monday, February 23, 2015 5:49 PM To: solr-user@lucene.apache.org Subject: Re: Stop solr query On 2/23/2015 7:23 AM, Moshe Recanati wrote: Recently there were some scenarios in which queries that user sent to solr got stuck and increased our solr heap. Is there any option to kill or timeout query that wasn't returned from solr by external command? The best thing you can do is examine all user input and stop such queries before they execute, especially if they are the kind of query that will cause your heap to grow out of control. The timeAllowed parameter can abort a query that takes too long in certain phases of the query. In recent months, Solr has been modified so that timeAllowed will take effect during more query phases. It is not a perfect solution, but it can be better than nothing. http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed Be aware that sometimes legitimate queries will be slow, and using timeAllowed may cause those queries to fail. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: 8 Shards of Cloud with 4.10.3.
On 2/25/2015 7:44 AM, Benson Margulies wrote: Shawn, I _am_ starting from clean. However, I didn't find a recipe for what you suggest as a process, and (following Hoss' suggestion) I found the recipe above with the boostrap_confdir scheme. I am mostly confused as to how I supply my solrconfig.xml and schema.xml when I follow the process you are suggesting. I know I'm verging on vampirism here, but if you could possibly find the time to turn your paragraph into either a pointer to a recipe or the command lines in a bit more detail, I'd be exceedingly grateful. I'm willing to help in any way that I can. Normally in the conf directory for a non-cloud core you have solrconfig.xml and schema.xml, plus any other configs referenced by those files, like synomyms.txt, dih-config.xml, etc. In cloud terms, the directory containing these files is a confdir. It's best to keep the on-disk copy of your configs completely outside of the solr home so there's no confusion about what configurations are active. On-disk cores for solrcloud do not need or use a conf directory. The cloud-scripts/zkcli.sh (or zkcli.bat) script has an upconfig command with -confdir and -confname options. When doing upconfig, the zkHost value goes on the -z option to zkcli, and you only need to list one of your zookeeper hosts, although it is perfectly happy if you list them all. You would point -confdir at a directory containing the config files mentioned earlier, and -confname is the name that the config has in zookeeper, which you would then use on the collection.configName parameter for the Collections API call. Once the config is uploaded, here's an example call to that API for creating a collection: http://server:port/solr/admin/collections?action=CREATEname=testnumShards=8replicationFactor=1collection.configName=testcfgmaxShardsPerNode=8 If this is not enough detail, please let me know which part you need help with. Thanks, Shawn
Re: Problem with queries that includes NOT
Hi, The edismax parser should be able to manage the query you want to ask. I've made a test and the next both queries give me the right result (see the parenthesis): - {!edismax}(NOT id:7 AND NOT id:8 AND id:9) (gives 1 hit the id:9) - {!edismax}((NOT id:7 AND NOT id:8) AND id:9) (gives 1 hit the id:9) In general, the issue appears when using the lucene query parser mixing different boolean clauses (including NOT). Thus, as you commented, the next queries gives different result - NOT id:7 AND NOT id:8 AND id:9 (gives 1 hit the id:9) - (NOT id:7 AND NOT id:8) AND id:9 (gives 0 hits when expecting 1 ) Since I read the chapter Limitations of prohibited clauses in sub-queries from the Apache Solr 3 Enterprise Search Server many years ago, I always add the *all documents query clause *:** to the negative clauses to avoid the problem you mentioned. Thus I will recommend to rewrite the query you showed us as: - (**:*: AND* NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE - (NOT id:7 AND NOT id:8 *AND *:**) AND id:9 (gives 1 hit as expected) The above query can be read then as give me all the documents except those having ID01 and PDF_TEXT and having PROTOTIPE Regards. On Wed, Feb 25, 2015 at 1:23 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote: We have problems with some queries. All of them include the tag NOT, and in my opinion, the results don´t make any sense. First problem: This query NOT Proc:ID01returns 95806 results, however this one NOT Proc:ID01 OR FileType:PDF_TEXT returns 11484 results. But it's impossible that adding a tag OR the query has less number of results. Second problem. Here the problem is because of the brackets and the NOT tag: This query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE returns 0 documents. But this query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) returns 53 documents, which is correct. So, the problem is the position of the bracket. I have checked the same query without NOTs, and it works fine returning the same number of results in both cases. So, I think the problem is the combination of the bracket positions and the NOT tag. For the first query, there is a difference between NOT condition1 OR condition2 and NOT (condition1 OR condition2) ... I can imagine the first one increasing the document count compared to just NOT condition1 ... the second one wouldn't increase it. Boolean queries in Solr (and very likely Lucene as well) do not always do what people expect. http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/ https://lucidworks.com/blog/why-not-and-or-and-not/ As mentioned in the second link above, you'll get better results if you use the prefix operators with explicit parentheses. One word of warning, though -- the prefix operators do not work correctly if you change the default operator to AND. Thanks, Shawn
Re: apache solr - dovecot - some search fields works some dont
Hi Alex, Below shows that Solr is not getting anything from the text search. I will try to search from / to and see hows the performance. select BAD Error in IMAP command INBOX: Unknown command. . select inbox * FLAGS (\Answered \Flagged \Deleted \Seen \Draft $Forwarded) * OK [PERMANENTFLAGS (\Answered \Flagged \Deleted \Seen \Draft $Forwarded \*)] Flags permitted. * 49983 EXISTS * 0 RECENT * OK [UNSEEN 46791] First unseen. * OK [UIDVALIDITY 1414214135] UIDs valid * OK [UIDNEXT 107218] Predicted next UID * OK [NOMODSEQ] No permanent modsequences . OK [READ-WRITE] Select completed (0.002 secs). search text dave search BAD Error in IMAP command TEXT: Unknown command. . search text dave * OK Searched 6% of the mailbox, ETA 2:24 * OK Searched 13% of the mailbox, ETA 2:10 * OK Searched 20% of the mailbox, ETA 1:54 * OK Searched 27% of the mailbox, ETA 1:46 * OK Searched 34% of the mailbox, ETA 1:36 * OK Searched 41% of the mailbox, ETA 1:26 * OK Searched 49% of the mailbox, ETA 1:11 * OK Searched 56% of the mailbox, ETA 1:02 * OK Searched 63% of the mailbox, ETA 0:52 * OK Searched 69% of the mailbox, ETA 0:44 * OK Searched 77% of the mailbox, ETA 0:31 * OK Searched 85% of the mailbox, ETA 0:20 * OK Searched 92% of the mailbox, ETA 0:10 * OK Searched 98% of the mailbox, ETA 0:02 On Wed, Feb 25, 2015 at 11:39 PM, Kevin Laurie superinterstel...@gmail.com wrote: Hi Alex, I get 1 error on start up Is the error below serious:- 2/25/2015, 11:32:30 PM ERROR SolrCore org.apache.solr.common.SolrException: undefined field text org.apache.solr.common.SolrException: undefined field text at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1269) at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:434) at org.apache.lucene.analysis.DelegatingAnalyzerWrapper$DelegatingReuseStrategy.getReusableComponents(DelegatingAnalyzerWrapper.java:74) at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:175) at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:207) at org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:374) at org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:742) at org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:541) at org.apache.solr.parser.QueryParser.Term(QueryParser.java:299) at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:185) at org.apache.solr.parser.QueryParser.Query(QueryParser.java:107) at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:96) at org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:151) at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50) at org.apache.solr.search.QParser.getQuery(QParser.java:141) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:148) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:64) at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1739) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) On Wed, Feb 25, 2015 at 3:08 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: The field definition looks fine. It's not storing any content (stored=false) but is indexing, so you should find the records but not see the body in them. Not seeing a log entry is more of a worry. Are you sure the request even made it to Solr? Can you see anything in Dovecot's logs? Or in Solr's access.logs (Actually Jetty/Tomcat's access logs that may need to be enabled first). At this point, you don't have enough information to fix anything. You need to understand what's different between request against subject vs. the request against body. I would break the communication in three stages: 1) What Dovecote sent 2) What Solr received 3) What Solr sent back I don't know your skill levels or your system setup to advise specifically, but Network tracer (e.g. Wireshark) is good for 1. Logs are good for 2. Using the query from 1) and manually running it against Solr is good for 3). Hope this helps, Alex. On 24 February 2015 at 12:35, Kevin Laurie superinterstel...@gmail.com wrote: field name=body type=text indexed=true stored=false / Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: Connect Solr with ODBC to Excel
Which direction? You want import data from Solr into Excel? One off or repeatedly? For one off Solr - Excel, you could probably use Excel's Open from Web and load data directly from Solr using CSV output format. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit
Re: Connect Solr with ODBC to Excel
Thanks for your answer. For a one-off it seems like a nice way to import my data. For an ODBC connection, the only solution I found is to replicate my Solr data in Apache Hive (or Cassandra...), and then connect to that database through ODBC. 2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com: Which direction? You want import data from Solr into Excel? One off or repeatedly? For one off Solr - Excel, you could probably use Excel's Open from Web and load data directly from Solr using CSV output format. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit -- Cordialement, Best regards, Hakim Benoudjit
Re: Basic Multilingual search capability
Hi Alex, Thanks for the suggestions. These steps will definitely help out with our use case. Thanks for the idea about the lengthFilter to protect our system. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 8:50 am Subject: Re: Basic Multilingual search capability Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Re: 8 Shards of Cloud with 4.10.3.
On 2/25/2015 8:35 AM, Benson Margulies wrote: Do I need a zkcli bootstrap or do I start with upconfig? What port does zkRun put zookeeper on? I personally would not use bootstrap options. They are only meant to be used once, when converting from non-cloud, but many people who use them do NOT use them only once -- they include them in their startup scripts and use them on every startup. The whole thing becomes extremely confusing. I would just use zkcli and the Collections API, so nothing ever happens that you don't explicitly request. I believe that the port for embedded zookeeper (zkRun) is the jetty listen port plus 1000, so 9983 if jetty.port is 8983 or not set. Thanks, Shawn
Re: performance issues with geofilt
Okay. Just to re-emphasize something I said but which may not have been clear, it isn’t an either-or for filter sort. Filter with the spatial field type that makes sense for filtering, sort (or boost) with the spatial field type that makes sense for sorting. RPT sucks for distance sorting, LatLonType is good for it. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 10:40 AM, dirk.thalh...@bkg.bund.de wrote: Hello David, thanks for your answer. In the meantime I found the memory hint too in http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Sorting_and_RelevancySo Maybe we switch to LatLonType for this kind of searches. But the RPT is also needed as we want to support search by arbitrary polygons. I'm also able to use the sort=geodist() asc. This works well when I modify the parameters to: q=*:*fq=typ:strassefq={!geofilt}sfield=geometrypt=51.370570625523,12.369290471603d=1.0sort=geofilt() asc Kind regards, Dirk Tue, 24 Feb 2015 19:42:03 GMT, david.w.smi...@gmail.com wrote: Hi Dirk, The RPT field type can be used for distance sorting/boosting but it's a memory pig when used as-such so don't do it unless you have to. You only have to if you have a multi-valued point field. If you have single-valued, use LatLonType specifically for distance sorting. Your sample query doesn't parse correctly for multiple reasons. You can't put a query into the sort parameter as you have done it. You have to do sort=query($sortQuery) ascsortQuery=... or a slightly different equivalent variation. Lets say you do that... still, I don't recommend this syntax when you simply want distance sort - just use geodist(), as in: sort=geodist() asc. If you want to use this syntax such as to sort by recipDistance, then it would look like this (note the filter=false hint to the spatial query parser, which otherwise is unaware it shouldn't bother actually search/filter): sort=query($sortQuery) descsortQuery={!geofilt score=recipDistance filter=false sfield=geometry pt=51.3,12.3 d=1.0} If you are able to use geodist() and still find it slow, there are alternatives involving using projected data and then with simply euclidean calculations, sqedist(): https://wiki.apache.org/solr/FunctionQuery#sqedist_-_Squared_Euclidean_Distance ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Tue, Feb 24, 2015 at 6:12 AM, dirk.thalh...@bkg.bund.de wrote: Hello, we are using solr 4.10.1. There are two cores for different use cases with around 20 million documents (location descriptions) per core. Each document has a geometry field which stores a point and a bbox field which stores a bounding box. Both fields are defined with: fieldType name=t_geometry class=solr.SpatialRecursivePrefixTreeFieldType spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory geo=true distErrPct=0.025 maxDistErr=0.9 units=degrees / I'm currently trying to add a location search (find all documents around a point). My intention is to add this as filter query, so that the user is able to do an additional keyword search. These are the query parameters so far: q=*:*fq=typ:strassefq={!geofilt sfield=geometry pt=51.370570625523,12.369290471603 d=1.0} To sort the documents by their distance to the requested point, I added following sort parameter: sort={!geofilt sort=distance sfield: geometry pt=51.370570625523,12.369290471603 d=1.0} asc Unfortunately I'm experiencing here some major performance/memory problems. The first distance query on a core takes over 10 seconds. In my first setup the same request to the second core completely blocked the server and caused an OutOfMemoryError. I had to increase the memory to 16 GB and now it seems to work for the geometry field. Anyhow the first request after a server restart takes some time and when I try it with the bbox field after a requested on the geometry field in both cores, the server blocks again. Can anyone explain why the distance needs so much memory? Can this be optimized? Kind regards, Dirk
Re: 8 Shards of Cloud with 4.10.3.
Bingo! Here's the recipe for the record: gcopts has the ton of gc options. First, set up shop: DIR=$PWD cd ../solr-4.10.3/example java -Xmx200g $gcopts DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Djetty.port=8983 -Dsolr.solr.home=/data/solr+rni/cloud_solr_home -Dsolr.install.dir=/dat\ a/solr-4.10.3 -Duser.timezone=UTC -Djava.net.preferIPv4Stack=true -DzkRun -jar start.jar and then: curl 'http://localhost:8983/solr/admin/collections?action=CREATEname=rninumShards=8replicationFactor=1collection.configName=rnimaxSh\ ardsPerNode=8' On Wed, Feb 25, 2015 at 11:03 AM, Benson Margulies bimargul...@gmail.com wrote: It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap', 'upconfig', and uploading a solr.xml. When I use upconfig, it might work, but it sure is noise: benson@ip-10-111-1-103:/data/solr+rni$ 554331 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN org.apache.zookeeper.server.NIOServerCnxn – caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14bc16c5e660003, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) On Wed, Feb 25, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 2/25/2015 8:35 AM, Benson Margulies wrote: Do I need a zkcli bootstrap or do I start with upconfig? What port does zkRun put zookeeper on? I personally would not use bootstrap options. They are only meant to be used once, when converting from non-cloud, but many people who use them do NOT use them only once -- they include them in their startup scripts and use them on every startup. The whole thing becomes extremely confusing. I would just use zkcli and the Collections API, so nothing ever happens that you don't explicitly request. I believe that the port for embedded zookeeper (zkRun) is the jetty listen port plus 1000, so 9983 if jetty.port is 8983 or not set. Thanks, Shawn
Re: apache solr - dovecot - some search fields works some dont
This is very serious. You are missing a field called text. You have a field _type_ called text, maybe that's where the confusion came from. Is that something you configured in dovecote? Was it supposed to be body or a catch-all field with copyFields into it? I don't know Dovecote, but it is a clear mismatch between expectations and reality. So, you need to check which one it is. One way would be to query Solr directly and see if you have anything in your body field. It's not stored, but you can check the indexed tokens in the Web Admin UI under Schema Definition (or some such) and asking to load token values for that field. If you have content in body field then your indexing works and either you need to search also against that field or have copyField instructions (which should have came with Dovecote install). Fix this first. Regards, Alex. On 25 February 2015 at 10:39, Kevin Laurie superinterstel...@gmail.com wrote: Hi Alex, I get 1 error on start up Is the error below serious:- 2/25/2015, 11:32:30 PM ERROR SolrCore org.apache.solr.common.SolrException: undefined field text org.apache.solr.common.SolrException: undefined field text at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1269) at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:434) Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Thanks for the suggestions. It always just indexes 1 doc, regardless of the first epub file it sees. Debug / verbose don't show anything obvious to me. I can include the output here if you think it would help. I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika parsing and that works OK so I don't think it's the e*pubs. I was trying to use DIH so that I could more easily specify the schema fields and store content in the index in preparation for trying out the search highlighting. Couldn't work out how to do that with post.jar Thanks, Gary On 25/02/2015 17:09, Alexandre Rafalovitch wrote: Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd. -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Facet By Distance
Hello, I'm trying to get Facet By Distance working on an index with LatLonType fields. The schema is as follows: fields ... field name=trip_duration type=int indexed=true stored=true/ field name=start_station type=location indexed=true stored=true / field name=end_station type=location indexed=true stored=true / field name=birth_year type=int stored=true/ field name=gender type=int stored=true / ... /fields And the query I'm running is: q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist() But it returns all the documents in the index so it seems something is missing. I'm using Solr 4.9.0. -- A. Adel
Re: Do Multiprocessing on Solr to search?
On 2/25/2015 9:31 AM, Nitin Solanki wrote: I want to search lakhs of queries/terms concurrently. Is there any technique to do multiprocessing on Solr? Is Solr is capable to handle this situation? I wrote a code in python that do multiprocessing and search lakhs of queries and do hit on Solr simultaneously/ parallely at once but it seems that Solr doesn't able to handle queries at once. Any help Please? Solr is fully multi-threaded and capable of handling multiple requests simultaneously. Any of the common servlet containers that are typically used to run Solr are *also* fully multi-threaded, but may require configuration adjustment to allow more threads. The jetty install that comes with the Solr example server is tuned to allow 1 threads. Even if you have a very well-tuned Solr install on exceptionally robust hardware, I would not expect a single index on a single server to be able to handle more than a few hundred requests per second. If you need hundreds of thousands of simultaneous queries, you're going to need a lot of replicas on a lot of servers. With that volume you would want a load balancer to direct requests to those replicas. You may also run into problems related to TCP port exhaustion. Thanks, Shawn
Re: Solr Document expiration with TTL
: Following query posts a document and sets expire_at_dt explicitly. That : is working perfectly ok and ducument expires at defined time. so the delete trigge logic is working correctly... : But when trying to post with TTL (following query), document does not : expire after given time. ...which suggests that the TTL-expire_at logic is not being applied properly. which is weird. since your time_to_live_s and expire_at_dt fields are both stored, can you confirm that a expire_at_dt field is getting popularted by the update processor by doing as simple query for your doc (ie q=id:10seconds) (either way: i can't explain why it's not getting deleted, but it would help narrow down where the problem is) -Hoss http://www.lucidworks.com/
RE: Do Multiprocessing on Solr to search?
Nitin Solanki [nitinml...@gmail.com] wrote: I want to search lakhs of queries/terms concurrently. Is there any technique to do multiprocessing on Solr? Each concurrent search in Solr runs in its own thread, so the answer is yes, it does so out of the box with concurrent searches. Is Solr is capable to handle this situation? Yes and no. There is a limit to the number of concurrent connections and as far as I remember, it is 10.000 out of the box. If you are using SolrCloud, deadlocks might happen if you exceed the limit. Anyway, I would not recommend running 10.000 concurrent searches as it leads to congestion. You will probably get a higher throughput by queueing your requests and process then with 100 concurrent searches or so. Do test. - Toke Eskildsen
Do Multiprocessing on Solr to search?
Hello, I want to search lakhs of queries/terms concurrently. Is there any technique to do multiprocessing on Solr? Is Solr is capable to handle this situation? I wrote a code in python that do multiprocessing and search lakhs of queries and do hit on Solr simultaneously/ parallely at once but it seems that Solr doesn't able to handle queries at once. Any help Please?
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
What about recursive=true? Do you have subdirectories that could make a difference. Your SimplePostTool would not look at subdirectories (great comparison, BTW). However, you do have lots of mapping options as well with /update/extract handler, look at the example and documentations. There is lots of mapping there. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 12:24, Gary Taylor g...@inovem.com wrote: Alex, Thanks for the suggestions. It always just indexes 1 doc, regardless of the first epub file it sees. Debug / verbose don't show anything obvious to me. I can include the output here if you think it would help. I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika parsing and that works OK so I don't think it's the e*pubs. I was trying to use DIH so that I could more easily specify the schema fields and store content in the index in preparation for trying out the search highlighting. Couldn't work out how to do that with post.jar Thanks, Gary On 25/02/2015 17:09, Alexandre Rafalovitch wrote: Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor |
Re: Connect Solr with ODBC to Excel
Some time ago I encounter https://github.com/kawasima/solr-jdbc never tried it.Anyway, it doesn't help to connect from odbc. On top of my head, is https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets but it returns only JSON, not csv. That's I wonder why. Seems like a dead end so far. On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Thanks for your answer. For a one-off it seems like a nice way to import my data. For an ODBC connection, the only solution I found is to replicate my Solr data in Apache Hive (or Cassandra...), and then connect to that database through ODBC. 2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com: Which direction? You want import data from Solr into Excel? One off or repeatedly? For one off Solr - Excel, you could probably use Excel's Open from Web and load data directly from Solr using CSV output format. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit -- Cordialement, Best regards, Hakim Benoudjit -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: 8 Shards of Cloud with 4.10.3.
On 2/25/2015 9:03 AM, Benson Margulies wrote: It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap', 'upconfig', and uploading a solr.xml. When I use upconfig, it might work, but it sure is noise: benson@ip-10-111-1-103:/data/solr+rni$ 554331 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN org.apache.zookeeper.server.NIOServerCnxn – caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14bc16c5e660003, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) The upconfig command is VERY noisy. A LOT of data is printed whether it's successful or not, and exceptions on a successful upload would actually not surprise me. An issue to reduce the zkcli output to short informational/error messages rather than the full zookeeper client logging is something I'll do soon if someone else doesn't get to it. I had never noticed the bootstrap option to zkcli before ... based on the options shown, I think it's meant to convert an entire non-cloud (and probably non-redundant) Solr installation (all cores currently present in the solr home) to SolrCloud. It's a conversion that would work, but I think it would be very ugly. There's also a bootstrap option for Solr that does this. Thanks, Shawn
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: how to debug solr performance degradation
rebecca, you probably need to dig into your queries, but if you want to force/preload the index into memory you could try doing something like cat `find /path/to/solr/index` /dev/null if you haven't already reviewed the following, you might take a look here https://wiki.apache.org/solr/SolrPerformanceProblems perhaps going back to a very vanilla/default solr configuration and building back up from that baseline to better isolate what might specific setting be impacting your environment From: Tang, Rebecca rebecca.t...@ucsf.edu Sent: Wednesday, February 25, 2015 11:44 To: solr-user@lucene.apache.org Subject: RE: how to debug solr performance degradation Sorry, I should have been more specific. I was referring to the solr admin UI page. Today we started up an AWS instance with 240 G of memory to see if we fit all of our index (183G) in the memory and have enough for the JMV, could it improve the performance. I attached the admin UI screen shot with the email. The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52 GB is used. The next bar is Swap Space and it¹s at 0.00 MB. The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G. My understanding is that when Solr starts up, it reserves some memory for the JVM, and then it tries to use up as much of the remaining physical memory as possible. And I used to see the physical memory at anywhere between 70% to 90+%. Is this understanding correct? And now, even with 240G of memory, our index is performing at 10 - 20 seconds for a query. Granted that our queries have fq¹s and highlighting and faceting, I think with a machine this powerful I should be able to get the queries executed under 5 seconds. This is what we send to Solr: q=(phillip%20morris) wt=json start=0 rows=50 facet=true facet.mincount=0 facet.pivot=industry,collection_facet facet.pivot=availability_facet,availabilitystatus_facet facet.field=dddate fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank% 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page% 22%20OR%20dt%3A%22tab%20sheet%22)) facet.field=dt_facet facet.field=brd_facet facet.field=dg_facet hl=true hl.simple.pre=%3Ch1%3E hl.simple.post=%3C%2Fh1%3E hl.requireFieldMatch=false hl.preserveMulti=true hl.fl=ot,ti f.ot.hl.fragsize=300 f.ot.hl.alternateField=ot f.ot.hl.maxAlternateFieldLength=300 f.ti.hl.fragsize=300 f.ti.hl.alternateField=ti f.ti.hl.maxAlternateFieldLength=300 fq={!collapse%20field=signature} expand=true sort=score+desc,availability_facet+asc My guess is that it¹s performing so badly because it¹s only using 4% of the memory? And searches require disk access. Rebecca From: Shawn Heisey [apa...@elyograg.org] Sent: Tuesday, February 24, 2015 5:23 PM To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation On 2/24/2015 5:45 PM, Tang, Rebecca wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? I would like to know what memory numbers in which program you are looking at, and why you believe those numbers are a problem. The JVM has a very different view of memory than the operating system. Numbers in top mean different things than numbers on the dashboard of the admin UI, or the numbers in jconsole. If you're on Windows, then replace top with task manager, process explorer, resource monitor, etc. Please provide as many details as you can about the things you are looking at. Thanks, Shawn
RE: how to debug solr performance degradation
Unfortunately (or luckily, depending on view), attachments does not work with this mailing list. You'll have to upload it somewhere and provide an URL. It is quite hard _not_ to get your whole index into disk cache, so my guess is that it will get there eventually. Just to check: If you re-issue your queries, does the response time change? If not, then disk caching is not the problem. Anyway, with your new information, I would say that pivot faceting is the culprit. Does the timing tests in https://issues.apache.org/jira/browse/SOLR-6803 line up with the cardinalities of your fields? My next step would be to disable parts of the query (highlight, faceting and collapsing one at a time) to check which part is the heaviest. - Toke Eskildsen From: Tang, Rebecca [rebecca.t...@ucsf.edu] Sent: 25 February 2015 20:44 To: solr-user@lucene.apache.org Subject: RE: how to debug solr performance degradation Sorry, I should have been more specific. I was referring to the solr admin UI page. Today we started up an AWS instance with 240 G of memory to see if we fit all of our index (183G) in the memory and have enough for the JMV, could it improve the performance. I attached the admin UI screen shot with the email. The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52 GB is used. The next bar is Swap Space and it¹s at 0.00 MB. The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G. My understanding is that when Solr starts up, it reserves some memory for the JVM, and then it tries to use up as much of the remaining physical memory as possible. And I used to see the physical memory at anywhere between 70% to 90+%. Is this understanding correct? And now, even with 240G of memory, our index is performing at 10 - 20 seconds for a query. Granted that our queries have fq¹s and highlighting and faceting, I think with a machine this powerful I should be able to get the queries executed under 5 seconds. This is what we send to Solr: q=(phillip%20morris) wt=json start=0 rows=50 facet=true facet.mincount=0 facet.pivot=industry,collection_facet facet.pivot=availability_facet,availabilitystatus_facet facet.field=dddate fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank% 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page% 22%20OR%20dt%3A%22tab%20sheet%22)) facet.field=dt_facet facet.field=brd_facet facet.field=dg_facet hl=true hl.simple.pre=%3Ch1%3E hl.simple.post=%3C%2Fh1%3E hl.requireFieldMatch=false hl.preserveMulti=true hl.fl=ot,ti f.ot.hl.fragsize=300 f.ot.hl.alternateField=ot f.ot.hl.maxAlternateFieldLength=300 f.ti.hl.fragsize=300 f.ti.hl.alternateField=ti f.ti.hl.maxAlternateFieldLength=300 fq={!collapse%20field=signature} expand=true sort=score+desc,availability_facet+asc My guess is that it¹s performing so badly because it¹s only using 4% of the memory? And searches require disk access. Rebecca From: Shawn Heisey [apa...@elyograg.org] Sent: Tuesday, February 24, 2015 5:23 PM To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation On 2/24/2015 5:45 PM, Tang, Rebecca wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? I would like to know what memory numbers in which program you are looking at, and why you believe those numbers are a problem. The JVM has a very different view of memory than the operating system. Numbers in top mean different things than numbers on the dashboard of the admin UI, or the numbers in jconsole. If you're on Windows, then replace top with task manager, process explorer, resource monitor, etc. Please provide as many details as you can about the things you are looking at. Thanks, Shawn
Re: New leader/replica solution for HDFS
I am also confused on this. Is adding replicas going to increase search performance? I'm not sure I see the point of any replicas when using HDFS. Is there one? Thank you! -Joe On 2/25/2015 10:57 AM, Erick Erickson wrote: bq: And the data sync between leader/replica is always a problem Not quite sure what you mean by this. There shouldn't need to be any synching in the sense that the index gets replicated, the incoming documents should be sent to each node (and indexed to HDFS) as they come in. bq: There is duplicate index computing on Replilca side. Yes, that's the design of SolrCloud, explicitly to provide data safety. If you instead rely on the leader to index and somehow pull that indexed form to the replica, then you will lose data if the leader goes down before sending the indexed form. bq: My thought is that the leader and the replica all bind to the same data index directory. This is unsafe. They would both then try to _write_ to the same index, which can easily corrupt indexes and/or all but the first one to access the index would be locked out. All that said, the HDFS triple-redundancy compounded with the Solr leaders/replicas redundancy means a bunch of extra storage. You can turn the HDFS replication down to 1, but that has other implications. Best, Erick On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote: We used HDFS as our Solr index storage and we really have a heavy update load. We had met much problems with current leader/replica solution. There is duplicate index computing on Replilca side. And the data sync between leader/replica is always a problem. As HDFS already provides data replication on data layer, could Solr provide just service layer replication? My thought is that the leader and the replica all bind to the same data index directory. And the leader will build up index for new request, the replica will just keep update the index version with the leader(such as a soft commit periodically? ). If the leader lost then the replica will take the duty immediately. Thanks for any suggestion of this idea. -- View this message in context: http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: how to debug solr performance degradation
Sorry, I should have been more specific. I was referring to the solr admin UI page. Today we started up an AWS instance with 240 G of memory to see if we fit all of our index (183G) in the memory and have enough for the JMV, could it improve the performance. I attached the admin UI screen shot with the email. The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52 GB is used. The next bar is Swap Space and it¹s at 0.00 MB. The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G. My understanding is that when Solr starts up, it reserves some memory for the JVM, and then it tries to use up as much of the remaining physical memory as possible. And I used to see the physical memory at anywhere between 70% to 90+%. Is this understanding correct? And now, even with 240G of memory, our index is performing at 10 - 20 seconds for a query. Granted that our queries have fq¹s and highlighting and faceting, I think with a machine this powerful I should be able to get the queries executed under 5 seconds. This is what we send to Solr: q=(phillip%20morris) wt=json start=0 rows=50 facet=true facet.mincount=0 facet.pivot=industry,collection_facet facet.pivot=availability_facet,availabilitystatus_facet facet.field=dddate fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank% 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page% 22%20OR%20dt%3A%22tab%20sheet%22)) facet.field=dt_facet facet.field=brd_facet facet.field=dg_facet hl=true hl.simple.pre=%3Ch1%3E hl.simple.post=%3C%2Fh1%3E hl.requireFieldMatch=false hl.preserveMulti=true hl.fl=ot,ti f.ot.hl.fragsize=300 f.ot.hl.alternateField=ot f.ot.hl.maxAlternateFieldLength=300 f.ti.hl.fragsize=300 f.ti.hl.alternateField=ti f.ti.hl.maxAlternateFieldLength=300 fq={!collapse%20field=signature} expand=true sort=score+desc,availability_facet+asc My guess is that it¹s performing so badly because it¹s only using 4% of the memory? And searches require disk access. Rebecca From: Shawn Heisey [apa...@elyograg.org] Sent: Tuesday, February 24, 2015 5:23 PM To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation On 2/24/2015 5:45 PM, Tang, Rebecca wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? I would like to know what memory numbers in which program you are looking at, and why you believe those numbers are a problem. The JVM has a very different view of memory than the operating system. Numbers in top mean different things than numbers on the dashboard of the admin UI, or the numbers in jconsole. If you're on Windows, then replace top with task manager, process explorer, resource monitor, etc. Please provide as many details as you can about the things you are looking at. Thanks, Shawn
Re: Add fields without manually editing Schema.xml.
Thanks a lot Alex... I thought about dynamic fields and will also explore the suggested options... On Wed, Feb 25, 2015 at 1:40 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Several ways. Reading through tutorials should help to get the details. But in short: 1) Map them to dynamic fields using prefixes and/or suffixes. 2) Use dynamic schema which will guess the types and creates the fields based on first use Something like SIREn might also be of interest: http://siren.solutions/siren/overview/ Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 13:26, Vishal Swaroop vishal@gmail.com wrote: Hi, Just wondering if there is a way to handle this use-case in SOLR without manually editing Schema.xml. Scenario : We have xml data with some elements/ attributes which we plan to index. As we move forward there can be addition of xml elements. Is there a way to handle this with out manually adding fields /changing in schema.xml ? Thanks V
Re: Add fields without manually editing Schema.xml.
Several ways. Reading through tutorials should help to get the details. But in short: 1) Map them to dynamic fields using prefixes and/or suffixes. 2) Use dynamic schema which will guess the types and creates the fields based on first use Something like SIREn might also be of interest: http://siren.solutions/siren/overview/ Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 13:26, Vishal Swaroop vishal@gmail.com wrote: Hi, Just wondering if there is a way to handle this use-case in SOLR without manually editing Schema.xml. Scenario : We have xml data with some elements/ attributes which we plan to index. As we move forward there can be addition of xml elements. Is there a way to handle this with out manually adding fields /changing in schema.xml ? Thanks V
Re: Facet By Distance
Hi, Thank you for your reply. I added a filter query to the query in two ways as follows: fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()d=0.2 -- returns 0 docs q=*:*fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069d=0.2 -- returns 1484 docs Not sure why the first query with returns 0 documents On Wed, Feb 25, 2015 at 8:46 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: Hi, This will return all the documents in the index because you did nothing to filter them out. Your query is *:* (everything) and there are no filter queries. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com wrote: Hello, I'm trying to get Facet By Distance working on an index with LatLonType fields. The schema is as follows: fields ... field name=trip_duration type=int indexed=true stored=true/ field name=start_station type=location indexed=true stored=true / field name=end_station type=location indexed=true stored=true / field name=birth_year type=int stored=true/ field name=gender type=int stored=true / ... /fields And the query I'm running is: q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist() But it returns all the documents in the index so it seems something is missing. I'm using Solr 4.9.0. -- A. Adel A. Adel
Re: Stop solr query
No. You can, but only search (collecting results) and not a query expansion. As I said, debugQuery=true, and the stacktrace or sampling can help to understand the reason. On Wed, Feb 25, 2015 at 5:45 PM, Moshe Recanati mos...@kmslh.com wrote: HI Mikhail, We're using 4.7.1. This means I can't stop the search. I think this is mandatory feature. Regards, Moshe Recanati SVP Engineering Office + 972-73-2617564 Mobile + 972-52-6194481 Skype: recanati More at: www.kmslh.com | LinkedIn | FB -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Wednesday, February 25, 2015 3:42 PM To: solr-user Subject: Re: Stop solr query Moshe, if you take a thread dump while a particular query stuck (via jstack of in SolrAdmin tab), it may explain where exactly it's stalled, just check the longest stack trace. FWIW, in 4.x timeallowed is checked only while documents are collected, and in 5 it's also checked during query expansion (see http://lucidworks.com/blog/solr-5-0/ now cut-offs requests https://issues.apache.org/jira/browse/SOLR-5986 during the query-expansion stage as well ). however I'm not sure it has place (long query expansion) with hon-synonyms. On Wed, Feb 25, 2015 at 3:21 PM, Moshe Recanati mos...@kmslh.com wrote: Hi Shawn, We checked this option and it didn't solve our problem. We're using https://github.com/healthonnet/hon-lucene-synonyms for query based synonyms. While running query with high number of words that have high number of synonyms the query got stuck and solr memory is exhausted. We tried to use this parameter suggested by you however it didn't stop the query and solve the issue. Please let me know if there is other option to tackle it. Today it might be high number of words that cause the issue and tomorrow it might be other something wrong. We can't rely only on user input check. Thank you in advance. Regards, Moshe Recanati SVP Engineering Office + 972-73-2617564 Mobile + 972-52-6194481 Skype: recanati More at: www.kmslh.com | LinkedIn | FB -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Monday, February 23, 2015 5:49 PM To: solr-user@lucene.apache.org Subject: Re: Stop solr query On 2/23/2015 7:23 AM, Moshe Recanati wrote: Recently there were some scenarios in which queries that user sent to solr got stuck and increased our solr heap. Is there any option to kill or timeout query that wasn't returned from solr by external command? The best thing you can do is examine all user input and stop such queries before they execute, especially if they are the kind of query that will cause your heap to grow out of control. The timeAllowed parameter can abort a query that takes too long in certain phases of the query. In recent months, Solr has been modified so that timeAllowed will take effect during more query phases. It is not a perfect solution, but it can be better than nothing. http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed Be aware that sometimes legitimate queries will be slow, and using timeAllowed may cause those queries to fail. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Facet By Distance
Hi, This will “return all the documents in the index” because you did nothing to filter them out. Your query is *:* (everything) and there are no filter queries. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com wrote: Hello, I'm trying to get Facet By Distance working on an index with LatLonType fields. The schema is as follows: fields ... field name=trip_duration type=int indexed=true stored=true/ field name=start_station type=location indexed=true stored=true / field name=end_station type=location indexed=true stored=true / field name=birth_year type=int stored=true/ field name=gender type=int stored=true / ... /fields And the query I'm running is: q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist() But it returns all the documents in the index so it seems something is missing. I'm using Solr 4.9.0. -- A. Adel
Add fields without manually editing Schema.xml.
Hi, Just wondering if there is a way to handle this use-case in SOLR without manually editing Schema.xml. Scenario : We have xml data with some elements/ attributes which we plan to index. As we move forward there can be addition of xml elements. Is there a way to handle this with out manually adding fields /changing in schema.xml ? Thanks V
Re: Connect Solr with ODBC to Excel
Thanks for the two links. The first one could be helpful if it works. Regarding the second one, I think it's quite similar to using /select to return json format. 2015-02-25 19:10 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com: Some time ago I encounter https://github.com/kawasima/solr-jdbc never tried it.Anyway, it doesn't help to connect from odbc. On top of my head, is https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets but it returns only JSON, not csv. That's I wonder why. Seems like a dead end so far. On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Thanks for your answer. For a one-off it seems like a nice way to import my data. For an ODBC connection, the only solution I found is to replicate my Solr data in Apache Hive (or Cassandra...), and then connect to that database through ODBC. 2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com: Which direction? You want import data from Solr into Excel? One off or repeatedly? For one off Solr - Excel, you could probably use Excel's Open from Web and load data directly from Solr using CSV output format. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit -- Cordialement, Best regards, Hakim Benoudjit -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Cordialement, Best regards, Hakim Benoudjit
problems retrieving term vectors using RealTimeGetHandler
I’m working with term vectors via solr. Is there a way to configure the RealTimeGetHandler to return tv info? Here is my environment info: Scotts-MacBook-Air-2:solr_jetty scottccote$ java -version java version 1.8.0_31 Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Scotts-MacBook-Air-2:solr_jetty scottccote$ uname -a Darwin Scotts-MacBook-Air-2.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64 solr-4.10.3 Here is my attempted configuration searchComponent name=tvComponent class=org.apache.solr.handler.component.TermVectorComponent/ requestHandler name=/get class=solr.RealTimeGetHandler lst name=defaults str name=omitHeadertrue/str bool name=tvtrue/bool /lst arr name=last-components strtvComponent/str /arr /requestHandler Here is my request on the Solr Admin panel qt is set to … /get Raw Query Parameters are set to … id=7tv=truetv.all=true http://localhost:8983/solr/question/get?wt=jsonindent=trueid=7tv=truetv.all=true http://localhost:8983/solr/question/get?wt=jsonindent=trueid=7tv=truetv.all=true which generates the following response (with error) { doc: { id: 7, classId: class1, studentId: fdsfsd, originalText: sing for raj, filteredText: [ sing for raj ], _version_: 1493662750219436000 }, termVectors: [ uniqueKeyFieldName, id ], error: { trace: java.lang.NullPointerException\n\tat org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:251)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:745)\n, code: 500 } } and stack trace on the server 8702 [qtp24433162-15] INFO org.apache.solr.servlet.SolrDispatchFilter – [admin] webapp=null path=/admin/info/system params={wt=json_=1424895828590} status=0 QTime=34 1645307 [qtp24433162-15] ERROR org.apache.solr.core.SolrCore – java.lang.NullPointerException at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:251) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218) at
Re: Facet By Distance
If ‘q’ is absent, then you always match nothing (there may be exceptions?); so it’s sort of required, in effect. I wish it defaulted to *:*. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 2:28 PM, Ahmed Adel ahmed.a...@badrit.com wrote: Hi, Thank you for your reply. I added a filter query to the query in two ways as follows: fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()d=0.2 -- returns 0 docs q=*:*fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069d=0.2 -- returns 1484 docs Not sure why the first query with returns 0 documents On Wed, Feb 25, 2015 at 8:46 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: Hi, This will return all the documents in the index because you did nothing to filter them out. Your query is *:* (everything) and there are no filter queries. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com wrote: Hello, I'm trying to get Facet By Distance working on an index with LatLonType fields. The schema is as follows: fields ... field name=trip_duration type=int indexed=true stored=true/ field name=start_station type=location indexed=true stored=true / field name=end_station type=location indexed=true stored=true / field name=birth_year type=int stored=true/ field name=gender type=int stored=true / ... /fields And the query I'm running is: q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist() But it returns all the documents in the index so it seems something is missing. I'm using Solr 4.9.0. -- A. Adel A. Adel
Re: Add fields without manually editing Schema.xml.
Solr also now has a schema API to dynamically edit the schema without the need to manually edit the schema file: https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-AddaDynamicFieldRule -- Jack Krupansky On Wed, Feb 25, 2015 at 3:15 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks a lot Alex... I thought about dynamic fields and will also explore the suggested options... On Wed, Feb 25, 2015 at 1:40 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Several ways. Reading through tutorials should help to get the details. But in short: 1) Map them to dynamic fields using prefixes and/or suffixes. 2) Use dynamic schema which will guess the types and creates the fields based on first use Something like SIREn might also be of interest: http://siren.solutions/siren/overview/ Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 13:26, Vishal Swaroop vishal@gmail.com wrote: Hi, Just wondering if there is a way to handle this use-case in SOLR without manually editing Schema.xml. Scenario : We have xml data with some elements/ attributes which we plan to index. As we move forward there can be addition of xml elements. Is there a way to handle this with out manually adding fields /changing in schema.xml ? Thanks V
Re: Connect Solr with ODBC to Excel
On Wed, Feb 25, 2015 at 10:31 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Thanks for the two links. The first one could be helpful if it works. Regarding the second one, I think it's quite similar to using /select to return json format. not really. /export yields much more data faster. Also, if you are interested in relatively short result set, you can /selectwt=csv no facets in this case, sadly. Just fyi. 2015-02-25 19:10 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com: Some time ago I encounter https://github.com/kawasima/solr-jdbc never tried it.Anyway, it doesn't help to connect from odbc. On top of my head, is https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets but it returns only JSON, not csv. That's I wonder why. Seems like a dead end so far. On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Thanks for your answer. For a one-off it seems like a nice way to import my data. For an ODBC connection, the only solution I found is to replicate my Solr data in Apache Hive (or Cassandra...), and then connect to that database through ODBC. 2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com: Which direction? You want import data from Solr into Excel? One off or repeatedly? For one off Solr - Excel, you could probably use Excel's Open from Web and load data directly from Solr using CSV output format. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit -- Cordialement, Best regards, Hakim Benoudjit -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Cordialement, Best regards, Hakim Benoudjit -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
RE: Collations are not working fine.
Hi Rajesh, That was very helpful. Based on your experience, I dug deeper into it and figured out that it does attempt to return collations for single term queries in my configuration as well. However, in the test cases I have been using, the suggested correction never gets any hits. Again, this is based on our use cases that always have at least one filter query present. As soon as I dropped the filter query, sure enough, collations were returned for the single term. But this still doesn't solve my original problem: The original term is never included in the collation results (or validated with a query like the suggested corrections). Thus, if it is a valid term, we don't want to throw it away. It would be great to have the collator validate it as a term (perhaps conditionally, based on the exactMatchFirst component dictionary parameter). But, at this point, I'm happy to just consult the origFreq value in the extended results. Thanks, Charlie -Original Message- From: Rajesh Hazari [mailto:rajeshhaz...@gmail.com] Sent: Monday, February 23, 2015 11:14 AM To: solr-user@lucene.apache.org Subject: Re: Collations are not working fine. Hi, we have used spellcheck component the below configs to get a best collation (exact collation) when a query has either single term or multiple terms. As charles, mentioned above we do have a check for getOriginalFrequency() for each term in our service before we send spellcheck response to client, this may not be the case for you, hope this helps request-handler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name=echoParamsexplicit/str int name=rows100/int str name=dftextSpell/str str name=spellchecktrue/str str name=spellcheck.dictionarydefault/str str name=spellcheck.dictionarywordbreak/str int name=spellcheck.count5/int * str name=spellcheck.alternativeTermCount15/str * * str name=spellcheck.collatetrue/str* * str name=spellcheck.onlyMorePopularfalse/str* * str name=spellcheck.extendedResultstrue/str* * str name =spellcheck.maxCollations100/str* * str name=spellcheck.collateParam.mm http://spellcheck.collateParam.mm100%/str* * str name=spellcheck.collateParam.q.opAND/str* * str name=spellcheck.maxCollationTries1000/str* str name=q.opOR/str . . .. /lst /request-handler . . . searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldtextSpell/str str name=combineWordstrue/str str name=breakWordsfalse/str int name=maxChanges5/int /lst lst name=spellchecker str name=namedefault/str str name=fieldtextSpell/str str name=classnamesolr.IndexBasedSpellChecker/str !-- str name=classnamesolr.DirectSolrSpellChecker/str -- str name=spellcheckIndexDir./spellchecker/str !-- str name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str-- str name=accuracy0.75/str float name=thresholdTokenFrequency0.01/float str name=buildOnCommittrue/str str name=spellcheck.maxResultsForSuggest5/str /lst /searchComponent *Rajesh**.* On Fri, Feb 20, 2015 at 8:42 AM, Nitin Solanki nitinml...@gmail.com wrote: How to get only the best collations whose hits are more and need to sort them? On Wed, Feb 18, 2015 at 3:53 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Hi Nitin, I was trying many different options for a couple different queries. In fact, I have collations working ok now with the Suggester and WFSTLookup. The problem may have been due to a different dictionary and/or lookup implementation and the specific options I was sending. In general, we're using spellcheck for search suggestions. The Suggester component (vs. Suggester spellcheck implementation), doesn't handle all of our cases. But we can get things working using the spellcheck interface. What gives us particular troubles are the cases where a term may be valid by itself, but also be the start of longer words. The specific terms are acronyms specific to our business. But I'll attempt to show generic examples. E.g. a partial term like fo can expand to fox, fog, etc. and a full term like brown can also expand to something like brownstone. And, yes, the collation brownstone fox is nonsense. But assume, for the sake of argument, it appears in our documents somewhere. For multiple term query with a spelling error (or partially typed term): brown fo We get collations in order of hits, descending like ... brown fox, brown fog, brownstone fox. So far, so good. For a single term query, brown, we get a single suggestion, brownstone and no collations. So, we don't know to keep the term brown! At this point, we need spellcheck.extendedResults=true and
Re: how to debug solr performance degradation
Before diving in too deeply, try attaching debug=timing to the query. Near the bottom of the response there'll be a list of the time taken by each _component_. So there'll be separate entries for query, highlighting, etc. This may not show any surprises, you might be spending all your time scoring. But it's worth doing as a check and might save you from going down some dead-ends. I mean if your query winds up spending 80% of its time in the highlighter you know where to start looking.. Best, Erick On Wed, Feb 25, 2015 at 12:01 PM, Boogie Shafer boogie.sha...@proquest.com wrote: rebecca, you probably need to dig into your queries, but if you want to force/preload the index into memory you could try doing something like cat `find /path/to/solr/index` /dev/null if you haven't already reviewed the following, you might take a look here https://wiki.apache.org/solr/SolrPerformanceProblems perhaps going back to a very vanilla/default solr configuration and building back up from that baseline to better isolate what might specific setting be impacting your environment From: Tang, Rebecca rebecca.t...@ucsf.edu Sent: Wednesday, February 25, 2015 11:44 To: solr-user@lucene.apache.org Subject: RE: how to debug solr performance degradation Sorry, I should have been more specific. I was referring to the solr admin UI page. Today we started up an AWS instance with 240 G of memory to see if we fit all of our index (183G) in the memory and have enough for the JMV, could it improve the performance. I attached the admin UI screen shot with the email. The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52 GB is used. The next bar is Swap Space and it¹s at 0.00 MB. The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G. My understanding is that when Solr starts up, it reserves some memory for the JVM, and then it tries to use up as much of the remaining physical memory as possible. And I used to see the physical memory at anywhere between 70% to 90+%. Is this understanding correct? And now, even with 240G of memory, our index is performing at 10 - 20 seconds for a query. Granted that our queries have fq¹s and highlighting and faceting, I think with a machine this powerful I should be able to get the queries executed under 5 seconds. This is what we send to Solr: q=(phillip%20morris) wt=json start=0 rows=50 facet=true facet.mincount=0 facet.pivot=industry,collection_facet facet.pivot=availability_facet,availabilitystatus_facet facet.field=dddate fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank% 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page% 22%20OR%20dt%3A%22tab%20sheet%22)) facet.field=dt_facet facet.field=brd_facet facet.field=dg_facet hl=true hl.simple.pre=%3Ch1%3E hl.simple.post=%3C%2Fh1%3E hl.requireFieldMatch=false hl.preserveMulti=true hl.fl=ot,ti f.ot.hl.fragsize=300 f.ot.hl.alternateField=ot f.ot.hl.maxAlternateFieldLength=300 f.ti.hl.fragsize=300 f.ti.hl.alternateField=ti f.ti.hl.maxAlternateFieldLength=300 fq={!collapse%20field=signature} expand=true sort=score+desc,availability_facet+asc My guess is that it¹s performing so badly because it¹s only using 4% of the memory? And searches require disk access. Rebecca From: Shawn Heisey [apa...@elyograg.org] Sent: Tuesday, February 24, 2015 5:23 PM To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation On 2/24/2015 5:45 PM, Tang, Rebecca wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? I would like to know what memory numbers in which program you are looking at, and why you believe those numbers are a problem. The JVM has a very different view of memory than the operating system. Numbers in top mean different things than numbers on the dashboard of the admin UI, or the numbers in jconsole. If you're on Windows, then replace top with task manager, process explorer, resource monitor, etc. Please provide as many details as you can about the things you are looking at. Thanks, Shawn
Re: Facet By Distance
In the examples it used to default to *:* with default params, which caused even more confusion. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 15:21, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: If ‘q’ is absent, then you always match nothing (there may be exceptions?); so it’s sort of required, in effect. I wish it defaulted to *:*. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 2:28 PM, Ahmed Adel ahmed.a...@badrit.com wrote: Hi, Thank you for your reply. I added a filter query to the query in two ways as follows: fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()d=0.2 -- returns 0 docs q=*:*fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069d=0.2 -- returns 1484 docs Not sure why the first query with returns 0 documents On Wed, Feb 25, 2015 at 8:46 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: Hi, This will return all the documents in the index because you did nothing to filter them out. Your query is *:* (everything) and there are no filter queries. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com wrote: Hello, I'm trying to get Facet By Distance working on an index with LatLonType fields. The schema is as follows: fields ... field name=trip_duration type=int indexed=true stored=true/ field name=start_station type=location indexed=true stored=true / field name=end_station type=location indexed=true stored=true / field name=birth_year type=int stored=true/ field name=gender type=int stored=true / ... /fields And the query I'm running is: q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist() But it returns all the documents in the index so it seems something is missing. I'm using Solr 4.9.0. -- A. Adel A. Adel
Re: Basic Multilingual search capability
Hi Rishi, As others have indicated Multilingual search is very difficult to do well. At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to deal with having materials in 400 languages. We also added the CJKBigramFilter to get better precision on CJK queries. We don't use stop words because stop words in one language are content words in another. For example die in German is a stopword but it is a content word in English. Putting multiple languages in one index can affect word frequency statistics which make relevance ranking less accurate. So for example for the English query Die Hard the word die would get a low idf score because it occurs so frequently in German. We realize that our approach does not produce the best results, but given the 400 languages, and limited resources, we do our best to make search not suck for non-English languages. When we have the resources we are thinking about doing special processing for a small fraction of the top 20 languages. We plan to select those languages that most need special processing and relatively easy to disambiguate from other languages. If you plan on identifying languages (rather than scripts), you should be aware that most language detection libraries don't work well on short texts such as queries. If you know that you have scripts for which you have content in only one language, you can use script detection instead of language detection. If you have German, a filter length of 25 might be too low (Because of compounding). You might want to analyze a sample of your German text to find a good length. Tom http://www.hathitrust.org/blogs/Large-scale-Search On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi Alex, Thanks for the suggestions. These steps will definitely help out with our use case. Thanks for the idea about the lengthFilter to protect our system. Thanks, Rishi. -Original Message- From: Alexandre Rafalovitch arafa...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 8:50 am Subject: Re: Basic Multilingual search capability Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Re: New leader/replica solution for HDFS
bq: Is adding replicas going to increase search performance? Absolutely, assuming you've maxed out Solr. You can scale the SOLR query/second rate nearly linearly by adding replicas regardless of whether it's over HDFS or not. Having multiple replicas per shard _also_ increases fault tolerance, so you get both. Even with HDFS, though, a single replica (just a leader) per shard means that you don't have any redundancy if the motherboard on that server dies even though HDFS has multiple copies of the _data_. Best, Erick On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger j...@lovehorsepower.com wrote: I am also confused on this. Is adding replicas going to increase search performance? I'm not sure I see the point of any replicas when using HDFS. Is there one? Thank you! -Joe On 2/25/2015 10:57 AM, Erick Erickson wrote: bq: And the data sync between leader/replica is always a problem Not quite sure what you mean by this. There shouldn't need to be any synching in the sense that the index gets replicated, the incoming documents should be sent to each node (and indexed to HDFS) as they come in. bq: There is duplicate index computing on Replilca side. Yes, that's the design of SolrCloud, explicitly to provide data safety. If you instead rely on the leader to index and somehow pull that indexed form to the replica, then you will lose data if the leader goes down before sending the indexed form. bq: My thought is that the leader and the replica all bind to the same data index directory. This is unsafe. They would both then try to _write_ to the same index, which can easily corrupt indexes and/or all but the first one to access the index would be locked out. All that said, the HDFS triple-redundancy compounded with the Solr leaders/replicas redundancy means a bunch of extra storage. You can turn the HDFS replication down to 1, but that has other implications. Best, Erick On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote: We used HDFS as our Solr index storage and we really have a heavy update load. We had met much problems with current leader/replica solution. There is duplicate index computing on Replilca side. And the data sync between leader/replica is always a problem. As HDFS already provides data replication on data layer, could Solr provide just service layer replication? My thought is that the leader and the replica all bind to the same data index directory. And the leader will build up index for new request, the replica will just keep update the index version with the leader(such as a soft commit periodically? ). If the leader lost then the replica will take the duty immediately. Thanks for any suggestion of this idea. -- View this message in context: http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: New leader/replica solution for HDFS
Thank you! I'm mainly concerned about facet performance. When we have indexing turned on, our facet performance suffers significantly. I will add replicas and measure the performance change. -Joe Obernberger On 2/25/2015 4:31 PM, Erick Erickson wrote: bq: Is adding replicas going to increase search performance? Absolutely, assuming you've maxed out Solr. You can scale the SOLR query/second rate nearly linearly by adding replicas regardless of whether it's over HDFS or not. Having multiple replicas per shard _also_ increases fault tolerance, so you get both. Even with HDFS, though, a single replica (just a leader) per shard means that you don't have any redundancy if the motherboard on that server dies even though HDFS has multiple copies of the _data_. Best, Erick On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger j...@lovehorsepower.com wrote: I am also confused on this. Is adding replicas going to increase search performance? I'm not sure I see the point of any replicas when using HDFS. Is there one? Thank you! -Joe On 2/25/2015 10:57 AM, Erick Erickson wrote: bq: And the data sync between leader/replica is always a problem Not quite sure what you mean by this. There shouldn't need to be any synching in the sense that the index gets replicated, the incoming documents should be sent to each node (and indexed to HDFS) as they come in. bq: There is duplicate index computing on Replilca side. Yes, that's the design of SolrCloud, explicitly to provide data safety. If you instead rely on the leader to index and somehow pull that indexed form to the replica, then you will lose data if the leader goes down before sending the indexed form. bq: My thought is that the leader and the replica all bind to the same data index directory. This is unsafe. They would both then try to _write_ to the same index, which can easily corrupt indexes and/or all but the first one to access the index would be locked out. All that said, the HDFS triple-redundancy compounded with the Solr leaders/replicas redundancy means a bunch of extra storage. You can turn the HDFS replication down to 1, but that has other implications. Best, Erick On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote: We used HDFS as our Solr index storage and we really have a heavy update load. We had met much problems with current leader/replica solution. There is duplicate index computing on Replilca side. And the data sync between leader/replica is always a problem. As HDFS already provides data replication on data layer, could Solr provide just service layer replication? My thought is that the leader and the replica all bind to the same data index directory. And the leader will build up index for new request, the replica will just keep update the index version with the leader(such as a soft commit periodically? ). If the leader lost then the replica will take the duty immediately. Thanks for any suggestion of this idea. -- View this message in context: http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to debug solr performance degradation
Lots of suggestions here already. +1 for those JVM params from Boogie and for looking at JMX. Rebecca, try SPM http://sematext.com/spm (will look at JMX for you, among other things), it may save you time figuring out JVM/heap/memory/performance issues. If you can't tell what's slow via SPM, we can have a look at your metrics (charts are sharable) and may be able to help you faster than guessing. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Wed, Feb 25, 2015 at 4:27 PM, Erick Erickson erickerick...@gmail.com wrote: Before diving in too deeply, try attaching debug=timing to the query. Near the bottom of the response there'll be a list of the time taken by each _component_. So there'll be separate entries for query, highlighting, etc. This may not show any surprises, you might be spending all your time scoring. But it's worth doing as a check and might save you from going down some dead-ends. I mean if your query winds up spending 80% of its time in the highlighter you know where to start looking.. Best, Erick On Wed, Feb 25, 2015 at 12:01 PM, Boogie Shafer boogie.sha...@proquest.com wrote: rebecca, you probably need to dig into your queries, but if you want to force/preload the index into memory you could try doing something like cat `find /path/to/solr/index` /dev/null if you haven't already reviewed the following, you might take a look here https://wiki.apache.org/solr/SolrPerformanceProblems perhaps going back to a very vanilla/default solr configuration and building back up from that baseline to better isolate what might specific setting be impacting your environment From: Tang, Rebecca rebecca.t...@ucsf.edu Sent: Wednesday, February 25, 2015 11:44 To: solr-user@lucene.apache.org Subject: RE: how to debug solr performance degradation Sorry, I should have been more specific. I was referring to the solr admin UI page. Today we started up an AWS instance with 240 G of memory to see if we fit all of our index (183G) in the memory and have enough for the JMV, could it improve the performance. I attached the admin UI screen shot with the email. The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52 GB is used. The next bar is Swap Space and it¹s at 0.00 MB. The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G. My understanding is that when Solr starts up, it reserves some memory for the JVM, and then it tries to use up as much of the remaining physical memory as possible. And I used to see the physical memory at anywhere between 70% to 90+%. Is this understanding correct? And now, even with 240G of memory, our index is performing at 10 - 20 seconds for a query. Granted that our queries have fq¹s and highlighting and faceting, I think with a machine this powerful I should be able to get the queries executed under 5 seconds. This is what we send to Solr: q=(phillip%20morris) wt=json start=0 rows=50 facet=true facet.mincount=0 facet.pivot=industry,collection_facet facet.pivot=availability_facet,availabilitystatus_facet facet.field=dddate fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank% 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page% 22%20OR%20dt%3A%22tab%20sheet%22)) facet.field=dt_facet facet.field=brd_facet facet.field=dg_facet hl=true hl.simple.pre=%3Ch1%3E hl.simple.post=%3C%2Fh1%3E hl.requireFieldMatch=false hl.preserveMulti=true hl.fl=ot,ti f.ot.hl.fragsize=300 f.ot.hl.alternateField=ot f.ot.hl.maxAlternateFieldLength=300 f.ti.hl.fragsize=300 f.ti.hl.alternateField=ti f.ti.hl.maxAlternateFieldLength=300 fq={!collapse%20field=signature} expand=true sort=score+desc,availability_facet+asc My guess is that it¹s performing so badly because it¹s only using 4% of the memory? And searches require disk access. Rebecca From: Shawn Heisey [apa...@elyograg.org] Sent: Tuesday, February 24, 2015 5:23 PM To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation On 2/24/2015 5:45 PM, Tang, Rebecca wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? I would like to know what memory numbers in which program you are looking at, and why you
Re: [ANNOUNCE] Luke 4.10.3 released
Hi Tomoko, Thanks for the link. Do you have build instructions somewhere? When I executed ant with no params, I get: BUILD FAILED /home/dmitry/projects/svn/luke/build.xml:40: /home/dmitry/projects/svn/luke/lib-ivy does not exist. On Thu, Feb 26, 2015 at 2:27 AM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Thanks! Would you announce at LUCENE-2562 to me and all watchers interested in this issue, when the branch is ready? :) As you know, current pivots's version (that supports Lucene 4.10.3) is here. http://svn.apache.org/repos/asf/lucene/sandbox/luke/ Regards, Tomoko 2015-02-25 18:37 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Ok, sure. The plan is to make the pivot branch in the current github repo and update its structure accordingly. Once it is there, I'll let you know. Thank you, Dmitry On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi Dmitry, Thank you for the detailed clarification! Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd like to some more work and keep up to date it. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Yes, I love to the idea about having common code base. I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's version has very different structure from github's (I think that is mainly for UI framework's requirement.) So it seems to be difficult to directly fork github's version to develop Pivot's version..., but I think I (or any other developers) could catch up changes in github's version. There's long way to go for Pivot's version, of course, I'd like to also make pull requests to enhance github's version if I can. Thanks, Tomoko 2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hi, Tomoko! Thanks for being a fan of luke! Current status of github's luke (https://github.com/DmitryKey/luke) is that it has releases for all the major lucene versions since 4.3.0, excluding 4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest -- 5.0.0. Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a long standing goal. With GWT I had issues related to listing and reading the index directory. So this effort has been parked. Most recently I have been approaching the Pivot. Mark Miller has done an initial port, that I took as the basis. I'm hoping to continue on this track as time permits. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Thanks, Dmitry On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi, I'm an user / fan of Luke, so deeply appreciate your work. I've carefully read the readme, noticed the (one of) project's goal: To port the thinlet UI to an ASL compliant license framework so that it can be contributed back to Apache Lucene. Current work is done with GWT 2.5.1. There has been GWT based, ASL compliant Luke supporting the latest Lucene ? I've recently got in with LUCENE-2562. Currently, Apache Pivot based port is going. But I do not know so much about Luke's long (and may be slightly complex) history, so I would grateful if anybody clear the association of the Luke project (now on Github) and the Jira issue. Or, they can be independent of each other. https://issues.apache.org/jira/browse/LUCENE-2562 I don't have any opinions, just want to understand current status and avoid duplicate works. Apologize for a bit annoying post. Many thanks, Tomoko 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hello, Luke 4.10.3 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3 The release has been tested against the solr-4.10.3 based index. Issues fixed in this release: #13 https://github.com/DmitryKey/luke/pull/13 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2 Thanks to respective contributors! P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for the next major release of luke. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Customized search handler components and cloud
We have a pair of customized search components which we used successfully with SolrCloud some releases back (4.x). In 4.10.3, I am trying to find the point of departure in debugging why we get no results back when querying to them with a sharded index. If I query the regular /select, all is swell. Obviously, there's a debugger in my future, but I wonder if this rings any bells for anyone. Here's what we add to solrconfig.xml. searchComponent name=name-indexing-query class=com.basistech.rni.solr.NameIndexingQueryComponent / searchComponent name=name-indexing-rescore class=com.basistech.rni.solr.NameIndexingRescoreComponent/ requestHandler name=/RNI class=solr.SearchHandler default=false arr name=first-components strname-indexing-query/str strname-indexing-rescore/str /arr /requestHandler
Re: Connect Solr with ODBC to Excel
I'll need to use /export since I retrieve large amount of data. And I don't really need facets, so it won't be an issue. Thanks again for your help. 2015-02-25 21:26 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com: On Wed, Feb 25, 2015 at 10:31 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Thanks for the two links. The first one could be helpful if it works. Regarding the second one, I think it's quite similar to using /select to return json format. not really. /export yields much more data faster. Also, if you are interested in relatively short result set, you can /selectwt=csv no facets in this case, sadly. Just fyi. 2015-02-25 19:10 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com : Some time ago I encounter https://github.com/kawasima/solr-jdbc never tried it.Anyway, it doesn't help to connect from odbc. On top of my head, is https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets but it returns only JSON, not csv. That's I wonder why. Seems like a dead end so far. On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com wrote: Thanks for your answer. For a one-off it seems like a nice way to import my data. For an ODBC connection, the only solution I found is to replicate my Solr data in Apache Hive (or Cassandra...), and then connect to that database through ODBC. 2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com : Which direction? You want import data from Solr into Excel? One off or repeatedly? For one off Solr - Excel, you could probably use Excel's Open from Web and load data directly from Solr using CSV output format. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote: Hi there, I'm looking for a library to connect Solr throught ODBC to Excel in order to do some reporting on my Solr data? Anybody knows a library for that? Thanks. -- Cordialement, Best regards, Hakim Benoudjit -- Cordialement, Best regards, Hakim Benoudjit -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Cordialement, Best regards, Hakim Benoudjit -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Cordialement, Best regards, Hakim Benoudjit
Re: [ANNOUNCE] Luke 4.10.3 released
Thanks! Would you announce at LUCENE-2562 to me and all watchers interested in this issue, when the branch is ready? :) As you know, current pivots's version (that supports Lucene 4.10.3) is here. http://svn.apache.org/repos/asf/lucene/sandbox/luke/ Regards, Tomoko 2015-02-25 18:37 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Ok, sure. The plan is to make the pivot branch in the current github repo and update its structure accordingly. Once it is there, I'll let you know. Thank you, Dmitry On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi Dmitry, Thank you for the detailed clarification! Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd like to some more work and keep up to date it. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Yes, I love to the idea about having common code base. I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's version has very different structure from github's (I think that is mainly for UI framework's requirement.) So it seems to be difficult to directly fork github's version to develop Pivot's version..., but I think I (or any other developers) could catch up changes in github's version. There's long way to go for Pivot's version, of course, I'd like to also make pull requests to enhance github's version if I can. Thanks, Tomoko 2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hi, Tomoko! Thanks for being a fan of luke! Current status of github's luke (https://github.com/DmitryKey/luke) is that it has releases for all the major lucene versions since 4.3.0, excluding 4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest -- 5.0.0. Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a long standing goal. With GWT I had issues related to listing and reading the index directory. So this effort has been parked. Most recently I have been approaching the Pivot. Mark Miller has done an initial port, that I took as the basis. I'm hoping to continue on this track as time permits. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Thanks, Dmitry On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi, I'm an user / fan of Luke, so deeply appreciate your work. I've carefully read the readme, noticed the (one of) project's goal: To port the thinlet UI to an ASL compliant license framework so that it can be contributed back to Apache Lucene. Current work is done with GWT 2.5.1. There has been GWT based, ASL compliant Luke supporting the latest Lucene ? I've recently got in with LUCENE-2562. Currently, Apache Pivot based port is going. But I do not know so much about Luke's long (and may be slightly complex) history, so I would grateful if anybody clear the association of the Luke project (now on Github) and the Jira issue. Or, they can be independent of each other. https://issues.apache.org/jira/browse/LUCENE-2562 I don't have any opinions, just want to understand current status and avoid duplicate works. Apologize for a bit annoying post. Many thanks, Tomoko 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hello, Luke 4.10.3 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3 The release has been tested against the solr-4.10.3 based index. Issues fixed in this release: #13 https://github.com/DmitryKey/luke/pull/13 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2 Thanks to respective contributors! P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for the next major release of luke. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Solr takes time to start
Hello, Why Solr is taking too much of time to start all nodes/ports?
Solr resource usage/Clustering
Hi, We have a single solr instance serving queries to the client through out the day and being indexed twice a day using scheduled jobs. During the scheduled jobs, which actually syncs databases from data collection machines to the master database, it can make many indexing calls. It is usually about 50k-100k records that are synced on each iteration of sync and we make calls to solr in batch of 1000 documents. Now, during the sync process, solr throws 503 (service not available message) quite frequently and in fact it responds very slow to index the documents. I have checked the cpu and memory usage during the sync process and it never consumed more than 40-50 % of CPU and 10-20% of RAM. My question is how to increase the performance of indexing to increase the speed up the sync process. -- Regards, Vikas Agarwal 91 – 9928301411 InfoObjects, Inc. Execution Matters http://www.infoobjects.com 2041 Mission College Boulevard, #280 Santa Clara, CA 95054 +1 (408) 988-2000 Work +1 (408) 716-2726 Fax
Re: New leader/replica solution for HDFS
Use DocValues. On Wed, Feb 25, 2015 at 3:14 PM, Joseph Obernberger j...@lovehorsepower.com wrote: Thank you! I'm mainly concerned about facet performance. When we have indexing turned on, our facet performance suffers significantly. I will add replicas and measure the performance change. -Joe Obernberger On 2/25/2015 4:31 PM, Erick Erickson wrote: bq: Is adding replicas going to increase search performance? Absolutely, assuming you've maxed out Solr. You can scale the SOLR query/second rate nearly linearly by adding replicas regardless of whether it's over HDFS or not. Having multiple replicas per shard _also_ increases fault tolerance, so you get both. Even with HDFS, though, a single replica (just a leader) per shard means that you don't have any redundancy if the motherboard on that server dies even though HDFS has multiple copies of the _data_. Best, Erick On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger j...@lovehorsepower.com wrote: I am also confused on this. Is adding replicas going to increase search performance? I'm not sure I see the point of any replicas when using HDFS. Is there one? Thank you! -Joe On 2/25/2015 10:57 AM, Erick Erickson wrote: bq: And the data sync between leader/replica is always a problem Not quite sure what you mean by this. There shouldn't need to be any synching in the sense that the index gets replicated, the incoming documents should be sent to each node (and indexed to HDFS) as they come in. bq: There is duplicate index computing on Replilca side. Yes, that's the design of SolrCloud, explicitly to provide data safety. If you instead rely on the leader to index and somehow pull that indexed form to the replica, then you will lose data if the leader goes down before sending the indexed form. bq: My thought is that the leader and the replica all bind to the same data index directory. This is unsafe. They would both then try to _write_ to the same index, which can easily corrupt indexes and/or all but the first one to access the index would be locked out. All that said, the HDFS triple-redundancy compounded with the Solr leaders/replicas redundancy means a bunch of extra storage. You can turn the HDFS replication down to 1, but that has other implications. Best, Erick On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote: We used HDFS as our Solr index storage and we really have a heavy update load. We had met much problems with current leader/replica solution. There is duplicate index computing on Replilca side. And the data sync between leader/replica is always a problem. As HDFS already provides data replication on data layer, could Solr provide just service layer replication? My thought is that the leader and the replica all bind to the same data index directory. And the leader will build up index for new request, the replica will just keep update the index version with the leader(such as a soft commit periodically? ). If the leader lost then the replica will take the duty immediately. Thanks for any suggestion of this idea. -- View this message in context: http://lucene.472066.n3.nabble.com/New-leader-replica- solution-for-HDFS-tp4188735.html Sent from the Solr - User mailing list archive at Nabble.com. -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Collations are not working fine.
Hi Rajesh, What configuration had you set in your schema.xml? On Sat, Feb 14, 2015 at 2:18 AM, Rajesh Hazari rajeshhaz...@gmail.com wrote: Hi Nitin, Can u try with the below config, we have these config seems to be working for us. searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_general/str lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldtextSpell/str str name=combineWordstrue/str str name=breakWordsfalse/str int name=maxChanges5/int /lst lst name=spellchecker str name=namedefault/str str name=fieldtextSpell/str str name=classnamesolr.IndexBasedSpellChecker/str str name=spellcheckIndexDir./spellchecker/str str name=accuracy0.75/str float name=thresholdTokenFrequency0.01/float str name=buildOnCommittrue/str str name=spellcheck.maxResultsForSuggest5/str /lst /searchComponent str name=spellchecktrue/str str name=spellcheck.dictionarydefault/str str name=spellcheck.dictionarywordbreak/str int name=spellcheck.count5/int str name=spellcheck.alternativeTermCount15/str str name=spellcheck.collatetrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultstrue/str str name =spellcheck.maxCollations100/str str name=spellcheck.collateParam.mm100%/str str name=spellcheck.collateParam.q.opAND/str str name=spellcheck.maxCollationTries1000/str *Rajesh.* On Fri, Feb 13, 2015 at 1:01 PM, Dyer, James james.d...@ingramcontent.com wrote: Nitin, Can you post the full spellcheck response when you query: q=gram_ci:gone wthh thes wintwt=jsonindent=trueshards.qt=/spell James Dyer Ingram Content Group -Original Message- From: Nitin Solanki [mailto:nitinml...@gmail.com] Sent: Friday, February 13, 2015 1:05 AM To: solr-user@lucene.apache.org Subject: Re: Collations are not working fine. Hi James Dyer, I did the same as you told me. Used WordBreakSolrSpellChecker instead of shingles. But still collations are not coming or working. For instance, I tried to get collation of gone with the wind by searching gone wthh thes wint on field=gram_ci but didn't succeed. Even, I am getting the suggestions of wtth as *with*, thes as *the*, wint as *wind*. Also I have documents which contains gone with the wind having 167 times in the documents. I don't know that I am missing something or not. Please check my below solr configuration: *URL: *localhost:8983/solr/wikingram/spell?q=gram_ci:gone wthh thes wintwt=jsonindent=trueshards.qt=/spell *solrconfig.xml:* searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpellCi/str lst name=spellchecker str name=namedefault/str str name=fieldgram_ci/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.5/float int name=maxEdits2/int int name=minPrefix0/int int name=maxInspections5/int int name=minQueryLength2/int float name=maxQueryFrequency0.9/float str name=comparatorClassfreq/str /lst lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldgram/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges5/int /lst /searchComponent requestHandler name=/spell class=solr.SearchHandler startup=lazy lst name=defaults str name=dfgram_ci/str str name=spellcheck.dictionarydefault/str str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count25/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.maxResultsForSuggest1/str str name=spellcheck.alternativeTermCount25/str str name=spellcheck.collatetrue/str str name=spellcheck.maxCollations50/str str name=spellcheck.maxCollationTries50/str str name=spellcheck.collateExtendedResultstrue/str /lst arr name=last-components strspellcheck/str /arr /requestHandler *Schema.xml: * field name=gram_ci type=textSpellCi indexed=true stored=true multiValued=false/ /fieldTypefieldType name=textSpellCi class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType
Re: Solr resource usage/Clustering
How are you indexing? SolrJ? DIH? some other process? And what, if anything, comes out in the Solr logs when this happens? 'cause this is pretty odd so I'm grasping at straws. Best, Erick On Wed, Feb 25, 2015 at 9:10 PM, Vikas Agarwal vi...@infoobjects.com wrote: Hi, We have a single solr instance serving queries to the client through out the day and being indexed twice a day using scheduled jobs. During the scheduled jobs, which actually syncs databases from data collection machines to the master database, it can make many indexing calls. It is usually about 50k-100k records that are synced on each iteration of sync and we make calls to solr in batch of 1000 documents. Now, during the sync process, solr throws 503 (service not available message) quite frequently and in fact it responds very slow to index the documents. I have checked the cpu and memory usage during the sync process and it never consumed more than 40-50 % of CPU and 10-20% of RAM. My question is how to increase the performance of indexing to increase the speed up the sync process. -- Regards, Vikas Agarwal 91 – 9928301411 InfoObjects, Inc. Execution Matters http://www.infoobjects.com 2041 Mission College Boulevard, #280 Santa Clara, CA 95054 +1 (408) 988-2000 Work +1 (408) 716-2726 Fax
Facet on TopDocs
We are trying to limit the number of facets returned only to the top 100 docs and not the complete result set.. Is there a way of accessing topDocs in the custom Faceting component? or Can the scores of the docID's in the resultset be accessed in the Facet Component? -- View this message in context: http://lucene.472066.n3.nabble.com/Facet-on-TopDocs-tp4188767.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [ANNOUNCE] Luke 4.10.3 released
Ok, sure. The plan is to make the pivot branch in the current github repo and update its structure accordingly. Once it is there, I'll let you know. Thank you, Dmitry On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi Dmitry, Thank you for the detailed clarification! Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd like to some more work and keep up to date it. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Yes, I love to the idea about having common code base. I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's version has very different structure from github's (I think that is mainly for UI framework's requirement.) So it seems to be difficult to directly fork github's version to develop Pivot's version..., but I think I (or any other developers) could catch up changes in github's version. There's long way to go for Pivot's version, of course, I'd like to also make pull requests to enhance github's version if I can. Thanks, Tomoko 2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hi, Tomoko! Thanks for being a fan of luke! Current status of github's luke (https://github.com/DmitryKey/luke) is that it has releases for all the major lucene versions since 4.3.0, excluding 4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest -- 5.0.0. Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a long standing goal. With GWT I had issues related to listing and reading the index directory. So this effort has been parked. Most recently I have been approaching the Pivot. Mark Miller has done an initial port, that I took as the basis. I'm hoping to continue on this track as time permits. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Thanks, Dmitry On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi, I'm an user / fan of Luke, so deeply appreciate your work. I've carefully read the readme, noticed the (one of) project's goal: To port the thinlet UI to an ASL compliant license framework so that it can be contributed back to Apache Lucene. Current work is done with GWT 2.5.1. There has been GWT based, ASL compliant Luke supporting the latest Lucene ? I've recently got in with LUCENE-2562. Currently, Apache Pivot based port is going. But I do not know so much about Luke's long (and may be slightly complex) history, so I would grateful if anybody clear the association of the Luke project (now on Github) and the Jira issue. Or, they can be independent of each other. https://issues.apache.org/jira/browse/LUCENE-2562 I don't have any opinions, just want to understand current status and avoid duplicate works. Apologize for a bit annoying post. Many thanks, Tomoko 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hello, Luke 4.10.3 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3 The release has been tested against the solr-4.10.3 based index. Issues fixed in this release: #13 https://github.com/DmitryKey/luke/pull/13 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2 Thanks to respective contributors! P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for the next major release of luke. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Creating a collection/core on HDFS with SolrCloud
Hello, I'm trying to create a collection on HDFS with Solr 5.0.0. I have my solrconfig.xml with the HDFS parameters, following the confluence guidelines. When creating with the bin/Solr script bin/solr create -c collectionHDFS -d /my/conf/ I have this error: failure:{:org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: https://192.168.200.32:8983/solr}} With the GUI on the SolrCloud server, I have this one: Error CREATEing SolrCore 'collectionHDFS': Unable to create core [collectionHDFS] Caused by: hadoop.security.authentication set to: simple, not kerberos, but attempting to connect to HDFS via kerberos On my /my/conf/solrconfig.xml, I have already double-checked that : bool name=solr.hdfs.security.kerberos.enabledtrue/bool str name=solr.hdfs.security.kerberos.keytabfile/my/conf/solr.keytab/str str name=solr.hdfs.security.kerberos.principalsolr/@CLUSTER.HADOOP/str and on Hadoop' core-site.xml, my hadoop.security.authentication parameter is set to Kerberos. Am I missing something ? Thank you very much for your input, have a great day. Simon M.
Re: highlighting the boolean query
Erick, Eric and Mike, Thanks for your help and ideas. It sounds like we'd need to do a bit of revamping in the highlighter. Perhaps even PostingsHighligher should be taken as the baseline, since it is faster. It uses the same extractTerms() method, that Erik has shown. The user story here is that user is made to believe, that the boolean query did not work correctly, judging from the highlights. The issue is minor otherwise, since the search *does* work as expected. Dmitry On Tue, Feb 24, 2015 at 8:19 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: There is also PostingsHighlighter -- I recommend it, if only for the performance improvement, which is substantial, but I'm not completely sure how it handles this issue. The one drawback I *am* aware of is that it is insensitive to positions (so words from phrases get highlighted even in isolation) -Mike On 02/24/2015 12:46 PM, Erik Hatcher wrote: BooleanQuery’s extractTerms looks like this: public void extractTerms(SetTerm terms) { for (BooleanClause clause : clauses) { if (clause.isProhibited() == false) { clause.getQuery().extractTerms(terms); } } } that’s generally the method called by the Highlighter for what terms should be highlighted. So even if a term didn’t match the document, the query that the term was in matched the document and it just blindly highlights all the terms (minus prohibited ones). That at least explains the behavior you’re seeing, but it’s not ideal. I’ve seen specialized highlighters that convert to spans, which are accurate to the exact matches within the document. Been a while since I dug into the HighlightComponent, so maybe there’s some other options available out of the box? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Feb 24, 2015, at 3:16 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, Our default operator is AND. Both queries below parse the same: a OR (b c) OR d a OR (b AND c) OR d The parsed query: str name=parsedquery_toStringContents:a (+Contents:b +Contents:c) Contents:d/str So this part is consistent with our expectation. I'm a bit puzzled by your statement that c didn't contribute to the score. what I meant was that the term c was not hit by the scorerer: the explain section does not refer to it. I'm using the made up terms here, but the concept holds. The code suggests that we could benefit from storing term offsets and positions: http://grepcode.com/file/repo1.maven.org/maven2/org. apache.solr/solr-core/4.3.1/org/apache/solr/highlight/ DefaultSolrHighlighter.java#470 Is it correct assumption? On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com wrote: Highlighting is such a pain... what does the parsed query look like? If the default operator is OR, then this seems correct as both 'd' and 'c' appear in the doc. So I'm a bit puzzled by your statement that c didn't contribute to the score. If the parsed query is, indeed a +b +c d then it does look like something with the highlighter. Whether other highlighters are better for this case.. no clue ;( Best, Erick On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, nope, we are using std lucene qparser with some customizations, that do not affect the boolean query parsing logic. Should we try some other highlighter? On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com wrote: Are you using edismax? On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello! In solr 4.3.1 there seem to be some inconsistency with the highlighting of the boolean query: a OR (b c) OR d This returns a proper hit, which shows that only d was included into the document score calculation. But the highlighter returns both d and c in em tags. Is this a known issue of the standard highlighter? Can it be mitigated? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Problem with queries that includes NOT
Hello, We have problems with some queries. All of them include the tag NOT, and in my opinion, the results don´t make any sense. First problem: This query NOT Proc:ID01returns 95806 results, however this one NOT Proc:ID01 OR FileType:PDF_TEXT returns 11484 results. But it's impossible that adding a tag OR the query has less number of results. Second problem. Here the problem is because of the brackets and the NOT tag: This query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE returns 0 documents. But this query: (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) returns 53 documents, which is correct. So, the problem is the position of the bracket. I have checked the same query without NOTs, and it works fine returning the same number of results in both cases. So, I think the problem is the combination of the bracket positions and the NOT tag. This second problem is less important, but the queries comes from a web page and I'd have to change it, so I need to know if the problem is Solr or not. This is the part of the scheme that applies: fieldType name=string class=solr.StrField sortMissingLast=true/ Thank you very much, David Dávila DIT - 915828763