Facets for fields in subdocuments with block join, is it possible?
Hello, I'm testing block join in solr 4.6.1 and wondering, is it possible to get facets for fields in subdocuments with number of hits based on ROOT documents? See example below: doc documentPartROOT/documentPart texttesting 123/text titletitle/test groupGRP/group subdocument field3khat/field3 field47000/field4 field5purchase/field5 /subdocoment subdocument field3cannabis/field3 field4500/field4 field5sale/field5 /subdocoment /doc My query looks like this: solrQuery.setQuery(text:testing); solrQuery.setFilterQueries({!parent which=\dokumentPart:ROOT\}field3:khat); solrQuery.setFacet(true); solrQuery.addFacetField(group,field5); This does not give me any facets for the subdocument fields, so i'm thinking, could a solution be to execute a second query to get the facets for the subdocument by join from parent to child whith a {!child of=} query like this: solrQuery.setQuery({!child of=\dokumentPart:ROOT\}text:testing); solrQuery.setFilterQueries(field3:khat); solrQuery.setFacet(true); solrQuery.addFacetField(field5,field4, field3); The problem with this method is that the facet count will be based on sub documents and not ROOT/parent documents... Is there a silver bullet for this kind of requirement? Yours faithfully Henning Solberg
Re: Group.Facet issue in Sharded Solr Setup
Quick follow up on my question below and if anyone is using Group.facets in a sharded solr setup ? Based on further testing, the group.facets counts dont seem reliable at all for lesser popular items in the facet list. -- View this message in context: http://lucene.472066.n3.nabble.com/Group-Facet-issue-in-Sharded-Solr-Setup-tp4116077p4116635.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets for fields in subdocuments with block join, is it possible?
Hello Henning, There is no open source facet component for child level of block-join. There is no even open jira for this. Don.t think it helps. 11.02.2014 12:22 пользователь Henning Ivan Solberg h...@lovdata.no написал: Hello, I'm testing block join in solr 4.6.1 and wondering, is it possible to get facets for fields in subdocuments with number of hits based on ROOT documents? See example below: doc documentPartROOT/documentPart texttesting 123/text titletitle/test groupGRP/group subdocument field3khat/field3 field47000/field4 field5purchase/field5 /subdocoment subdocument field3cannabis/field3 field4500/field4 field5sale/field5 /subdocoment /doc My query looks like this: solrQuery.setQuery(text:testing); solrQuery.setFilterQueries({!parent which=\dokumentPart:ROOT\} field3:khat); solrQuery.setFacet(true); solrQuery.addFacetField(group,field5); This does not give me any facets for the subdocument fields, so i'm thinking, could a solution be to execute a second query to get the facets for the subdocument by join from parent to child whith a {!child of=} query like this: solrQuery.setQuery({!child of=\dokumentPart:ROOT\}text:testing); solrQuery.setFilterQueries(field3:khat); solrQuery.setFacet(true); solrQuery.addFacetField(field5,field4, field3); The problem with this method is that the facet count will be based on sub documents and not ROOT/parent documents... Is there a silver bullet for this kind of requirement? Yours faithfully Henning Solberg
Set up embedded Solr container and cores programmatically to read their configs from the classpath
Hi, I have an application with an embedded Solr instance (and I want to keep it embedded) and so far I have been setting up my Solr installation programmatically using folder paths to specify where the specific container or core configs are. I have used the CoreContainer methods createAndLoad and create using File arguments and this works fine. However, now I want to change this so that all configuration files are loaded from certain locations using the classloader but I have not been able to get this to work. E.g. I want to have my solr config located in the classpath at my/base/package/solr/conf and the core configs at my/base/package/solr/cores/core1/conf, my/base/package/solr/cores/core2/conf etc.. Is this possible at all? Looking through the source code it seems that specifying classpath resources in such a qualified way is not supported but I may be wrong. I could get this to work for the container by supplying my own implementation of SolrResourceLoader that allows a base path to be specified for the resources to be loaded (I first thought that would happen already when specifying instanceDir accordingly but looking at the code it does not. for resources loaded through the classloader, instanceDir is not prepended). However then I am stuck with the loading of the cores' resources as the respective code (see org.apache.solr.core.CoreContainer#createFromLocal) instantiates a SolResourceLoader internally. Thanks for any help with this (be it a clarification that it is not possible). Robert
How to Learn Linked Configuration for SolrCloud at Zookeeper
Hi; I've written a code that I can update a file to Zookeeper for SlorCloud. Currently I have many configurations at Zookeeper for SolrCloud. I want to update synonyms.txt file so I should know the currently linked configuration (I will update the synonyms.txt file under appropriate configuration folder) How can I learn it? Thanks; Furkan KAMACI
Re: How to Learn Linked Configuration for SolrCloud at Zookeeper
For a particular collection or core? There should be a collection.configName property specified for the core or collection which tells you which ZK config directory is being used. Alan Woodward www.flax.co.uk On 11 Feb 2014, at 11:49, Furkan KAMACI wrote: Hi; I've written a code that I can update a file to Zookeeper for SlorCloud. Currently I have many configurations at Zookeeper for SolrCloud. I want to update synonyms.txt file so I should know the currently linked configuration (I will update the synonyms.txt file under appropriate configuration folder) How can I learn it? Thanks; Furkan KAMACI
Re: How to Learn Linked Configuration for SolrCloud at Zookeeper
I am looking it for a particular collection. 2014-02-11 13:55 GMT+02:00 Alan Woodward a...@flax.co.uk: For a particular collection or core? There should be a collection.configName property specified for the core or collection which tells you which ZK config directory is being used. Alan Woodward www.flax.co.uk On 11 Feb 2014, at 11:49, Furkan KAMACI wrote: Hi; I've written a code that I can update a file to Zookeeper for SlorCloud. Currently I have many configurations at Zookeeper for SolrCloud. I want to update synonyms.txt file so I should know the currently linked configuration (I will update the synonyms.txt file under appropriate configuration folder) How can I learn it? Thanks; Furkan KAMACI
Re: How to Learn Linked Configuration for SolrCloud at Zookeeper
Hi; OK, I've checked the source code and implemented that: public String readConfigName(SolrZkClient zkClient, String collection) throws KeeperException, InterruptedException { String configName = null; String path = ZkStateReader.COLLECTIONS_ZKNODE + / + collection; LOGGER.info(Load collection config from: + path); byte[] data = zkClient.getData(path, null, null, true); if (data != null) { ZkNodeProps props = ZkNodeProps.load(data); configName = props.getStr(CONFIGNAME_PROP); } if (configName != null !zkClient.exists(CONFIGS_ZKNODE + / + configName, true)) { LOGGER.error(Specified config does not exist in ZooKeeper: + configName); throw new ZooKeeperException(SolrException.ErrorCode.SERVER_ERROR, Specified config does not exist in ZooKeeper: + configName); } return configName; } So, I can get the linked configuration name. Thanks; Furkan KAMACI 2014-02-11 13:57 GMT+02:00 Furkan KAMACI furkankam...@gmail.com: I am looking it for a particular collection. 2014-02-11 13:55 GMT+02:00 Alan Woodward a...@flax.co.uk: For a particular collection or core? There should be a collection.configName property specified for the core or collection which tells you which ZK config directory is being used. Alan Woodward www.flax.co.uk On 11 Feb 2014, at 11:49, Furkan KAMACI wrote: Hi; I've written a code that I can update a file to Zookeeper for SlorCloud. Currently I have many configurations at Zookeeper for SolrCloud. I want to update synonyms.txt file so I should know the currently linked configuration (I will update the synonyms.txt file under appropriate configuration folder) How can I learn it? Thanks; Furkan KAMACI
Re: Lowering query time
I'd like to thank you for lending a hand on my query time problem with SolrCloud. By switching to a single shard with replicas setup, I've reduced my query time to 18 msec. My full ingestion of 300k+ documents went down from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes that are going in that should help a bit as well. Big thanks to everyone that had suggestions. On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: I suspect faceting is the issue here. The actual query you have shown seem to bring back a single document (or a single set of document for a product): fq=id:(320403401) On the other hand, you are asking for 4 field facets: facet.field=q_virtualCategory_ss facet.field=q_brand_s facet.field=q_color_s facet.field=q_category_ss AND 2 range facets, both clustered/grouped: facet.range=daysSinceStart_i facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000) And for all facets you have asked to bring back ALL of the results: facet.limit=-1 Plus, you are doing a complex sort: sort=popularity_i desc,popularity_i desc So, you are probably spending quite a bit of time counting (especially in a shared setup) and then quite a bit more sending the response back. I would check the size of the result document (HTTP result) and see how large it is. Maybe you don't need all of the stuff that's coming back. I assume you are not actually querying Solr from the client's machine (that is I hope it is inside your data centre close to your web server), otherwise I would say to look at automatic content compression as well to minimize on-wire document size. Finally, if your documents have many stored fields (store=true in schema.xml) but you only return small subsets of them during search, you could look into using enableLazyFieldLoading flag in the solrconfig. Regards, Alex. P.s. As others said, you don't seem to have too many documents. Perhaps you want replication instead of sharding for improved performance. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin alexey_kozhemia...@epam.com wrote: Btw timing for distributed requests are broken at this moment, it doesn't combine values from requests to shards. I'm working on a patch. https://issues.apache.org/jira/browse/SOLR-3644 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Tuesday, February 04, 2014 22:00 To: solr-user@lucene.apache.org Subject: Re: Lowering query time Add the debug=true parameter to some test queries and look at the timing section to see which search components are taking the time. Traditionally, highlighting for large documents was a top culprit. Are you returning a lot of data or field values? Sometimes reducing the amount of data processed can help. Any multivalued fields with lots of values? -- Jack Krupansky -Original Message- From: Joel Cohen Sent: Tuesday, February 4, 2014 1:43 PM To: solr-user@lucene.apache.org Subject: Re: Lowering query time 1. We are faceting. I'm not a developer so I'm not quite sure how we're doing it. How can I measure? 2. I'm not sure how we'd force this kind of document partitioning. I can see how my shards are partitioned by looking at the clusterstate.json from Zookeeper, but I don't have a clue on how to get documents into specific shards. Would I be better off with fewer shards given the small size of my indexes? On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley yo...@heliosearch.com wrote: On Tue, Feb 4, 2014 at 12:12 PM, Joel Cohen joel.co...@bluefly.com wrote: I'm trying to get the query time down to ~15 msec. Anyone have any tuning recommendations? I guess it depends on what the slowest part of the query currently is. If you are faceting, it's often that. Also, it's often a big win if you can somehow partition documents such that requests can normally be serviced from a single shard. -Yonik http://heliosearch.org - native off-heap filters and fieldcache for solr -- joel cohen, senior system engineer e joel.co...@bluefly.com p 212.944.8000 x276 bluefly, inc. 42 w. 39th st. new york, ny 10018 www.bluefly.com http://www.bluefly.com/?referer=autosig | *fly since 2013...* -- joel cohen, senior system engineer e joel.co...@bluefly.com p 212.944.8000 x276 bluefly, inc. 42 w. 39th st. new york, ny 10018 www.bluefly.com http://www.bluefly.com/?referer=autosig | *fly since 2013...*
Urgent Help. Best Way to have multiple OR Conditions for same field in SOLR
HI, I am new to SOLR , we have CRM data for Contacts and Companies which are in millions, we have switched to SOLR for fast search results. PROBLEM: We have large inclusion and exclusion lists with names of companies or contacts. Ex: Include or Exclude : company A Company B Company C Company n where assume n = 1; What would be the best way to do this kind of a query using SOLR. WHAT I HAVE TRIED: Setting q == field_name: (companyA OR companyB . OR Company n); This works only for a list of 400 odd. Looking forward for assistance on this. Thank You, Rajeev. -- View this message in context: http://lucene.472066.n3.nabble.com/Urgent-Help-Best-Way-to-have-multiple-OR-Conditions-for-same-field-in-SOLR-tp4116681.html Sent from the Solr - User mailing list archive at Nabble.com.
solr-query with NOT and OR operator
Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes
Re: solr-query with NOT and OR operator
http://wiki.apache.org/solr/CommonQueryParameters#debugQuery and http://wiki.apache.org/solr/CommonQueryParameters#explainOther usually help so much On Tue, Feb 11, 2014 at 7:57 PM, Johannes Siegert johannes.sieg...@marktjagd.de wrote: Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Tf-Idf for a specific query
Hi Erick, Slower queries for getting facets can be tolerated, as long as they don't affect those without facets. The requirement is for a separate query which can get me both term vector and facet counts. One issue I am facing is that, for a search query I only want the term vectors and facet counts, but not the results/docs. If I set the rows=0, then term vectors are not returned. Could you suggest some way to achieve the above. Also it will be helpful to get a way to get aggregate TF of a term (across all docs in the query). Regards, David On Sat, Feb 8, 2014 at 10:49 AM, Erick Erickson erickerick...@gmail.comwrote: David: If you're, say, faceting on fields with lots of unique values, this will be quite expensive. No idea whether you can tolerate slower queries or not, just sayin' Erick On Fri, Feb 7, 2014 at 5:35 PM, David Miller davthehac...@gmail.com wrote: Thanks Mikhai, It seems that, this was what I was looking for. Being new to this, I wasn't aware of such a use of facets. Now I can probably combine the term vectors and facets to fit my scenario. Regards, Dave On Fri, Feb 7, 2014 at 2:43 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: David, I can imagine that DF for resultset is facets! On Fri, Feb 7, 2014 at 11:26 PM, David Miller davthehac...@gmail.com wrote: Hi Mikhail, The DF seems to be based on the entire document set. What I require is based on a the results of a single query. Suppose my Solr query returns a set of 50K documents from a superset of 10Million documents, I require to calculate the DF just based on the 50K documents. But currently it seems to be calculated on the entire doc set. So, is there any way to get the DF or IDF just on basis of the docs returned by the query? Regards, Dave On Fri, Feb 7, 2014 at 5:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Dave you can get DF from http://wiki.apache.org/solr/TermsComponent(invert it yourself) then, for certain term you can get number of occurrences per document by http://wiki.apache.org/solr/FunctionQuery#tf On Fri, Feb 7, 2014 at 3:58 AM, David Miller davthehac...@gmail.com wrote: Hi Guys.. I require to obtain Tf-idf score from Solr for a certain set of documents. But the catch is that, I needs the IDF (or DF) to be calculated on the documents returned by the specific query and not the entire corpus. Please provide me some hint on whether Solr has this feature or if I can use the Lucene Api directly to achieve this. Thanks in advance, Dave -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: solr-query with NOT and OR operator
With so many parentheses in there, I wonder what you are really trying to do Try expressing your query in simple English first so that we can understand your goal. But generally, a purely negative nested query must have a *:* term to apply the exclusion against: fq=((*:* -(field1:value1)))+OR+(field2:value2). -- Jack Krupansky -Original Message- From: Johannes Siegert Sent: Tuesday, February 11, 2014 10:57 AM To: solr-user@lucene.apache.org Subject: solr-query with NOT and OR operator Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes
Re: Lowering query time
Hmmm, I'm still a little puzzled BTW. 300K documents, unless they're huge, shouldn't be taking 100 minutes. I can index 11M documents on my laptop (Wikipedia dump) in 45 minutes for instance Of course that's a single core, not cloud and not replicas... So possibly it' on the data acquisition side? Is your Solr CPU pegged? YMMV of course. Erick On Tue, Feb 11, 2014 at 6:40 AM, Joel Cohen joel.co...@bluefly.com wrote: I'd like to thank you for lending a hand on my query time problem with SolrCloud. By switching to a single shard with replicas setup, I've reduced my query time to 18 msec. My full ingestion of 300k+ documents went down from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes that are going in that should help a bit as well. Big thanks to everyone that had suggestions. On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I suspect faceting is the issue here. The actual query you have shown seem to bring back a single document (or a single set of document for a product): fq=id:(320403401) On the other hand, you are asking for 4 field facets: facet.field=q_virtualCategory_ss facet.field=q_brand_s facet.field=q_color_s facet.field=q_category_ss AND 2 range facets, both clustered/grouped: facet.range=daysSinceStart_i facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000) And for all facets you have asked to bring back ALL of the results: facet.limit=-1 Plus, you are doing a complex sort: sort=popularity_i desc,popularity_i desc So, you are probably spending quite a bit of time counting (especially in a shared setup) and then quite a bit more sending the response back. I would check the size of the result document (HTTP result) and see how large it is. Maybe you don't need all of the stuff that's coming back. I assume you are not actually querying Solr from the client's machine (that is I hope it is inside your data centre close to your web server), otherwise I would say to look at automatic content compression as well to minimize on-wire document size. Finally, if your documents have many stored fields (store=true in schema.xml) but you only return small subsets of them during search, you could look into using enableLazyFieldLoading flag in the solrconfig. Regards, Alex. P.s. As others said, you don't seem to have too many documents. Perhaps you want replication instead of sharding for improved performance. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin alexey_kozhemia...@epam.com wrote: Btw timing for distributed requests are broken at this moment, it doesn't combine values from requests to shards. I'm working on a patch. https://issues.apache.org/jira/browse/SOLR-3644 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Tuesday, February 04, 2014 22:00 To: solr-user@lucene.apache.org Subject: Re: Lowering query time Add the debug=true parameter to some test queries and look at the timing section to see which search components are taking the time. Traditionally, highlighting for large documents was a top culprit. Are you returning a lot of data or field values? Sometimes reducing the amount of data processed can help. Any multivalued fields with lots of values? -- Jack Krupansky -Original Message- From: Joel Cohen Sent: Tuesday, February 4, 2014 1:43 PM To: solr-user@lucene.apache.org Subject: Re: Lowering query time 1. We are faceting. I'm not a developer so I'm not quite sure how we're doing it. How can I measure? 2. I'm not sure how we'd force this kind of document partitioning. I can see how my shards are partitioned by looking at the clusterstate.json from Zookeeper, but I don't have a clue on how to get documents into specific shards. Would I be better off with fewer shards given the small size of my indexes? On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley yo...@heliosearch.com wrote: On Tue, Feb 4, 2014 at 12:12 PM, Joel Cohen joel.co...@bluefly.com wrote: I'm trying to get the query time down to ~15 msec. Anyone have any tuning recommendations? I guess it depends on what the slowest part of the query currently is. If you are faceting, it's often that. Also, it's often a big win if you can somehow partition documents such that requests can normally be serviced from a single shard. -Yonik http://heliosearch.org - native off-heap filters and fieldcache for solr -- joel cohen, senior system engineer e joel.co...@bluefly.com p 212.944.8000
Re: Urgent Help. Best Way to have multiple OR Conditions for same field in SOLR
right, 10K Boolean clauses are not very efficient. You actually can up the limit here, but still... Consider a post filter, here's a place to start: http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/search/PostFilter.html Best, Erick On Tue, Feb 11, 2014 at 6:47 AM, rajeev.nadgauda rajeev.nadga...@leadenrich.com wrote: HI, I am new to SOLR , we have CRM data for Contacts and Companies which are in millions, we have switched to SOLR for fast search results. PROBLEM: We have large inclusion and exclusion lists with names of companies or contacts. Ex: Include or Exclude : company A Company B Company C Company n where assume n = 1; What would be the best way to do this kind of a query using SOLR. WHAT I HAVE TRIED: Setting q == field_name: (companyA OR companyB . OR Company n); This works only for a list of 400 odd. Looking forward for assistance on this. Thank You, Rajeev. -- View this message in context: http://lucene.472066.n3.nabble.com/Urgent-Help-Best-Way-to-have-multiple-OR-Conditions-for-same-field-in-SOLR-tp4116681.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr-query with NOT and OR operator
Solr/Lucene is not strictly Boolean logic, this trips up a lot of people. Excellent blog on the subject here: http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/ Best, Erick On Tue, Feb 11, 2014 at 8:22 AM, Jack Krupansky j...@basetechnology.comwrote: With so many parentheses in there, I wonder what you are really trying to do Try expressing your query in simple English first so that we can understand your goal. But generally, a purely negative nested query must have a *:* term to apply the exclusion against: fq=((*:* -(field1:value1)))+OR+(field2:value2). -- Jack Krupansky -Original Message- From: Johannes Siegert Sent: Tuesday, February 11, 2014 10:57 AM To: solr-user@lucene.apache.org Subject: solr-query with NOT and OR operator Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes
Re: Is \'optimize\' necessary for a 45-segment Solr 4.6 index?
On 2/11/2014 3:27 AM, Jäkel, Guido wrote: Dear Shawn, On 2/9/2014 11:41 PM, Arun Rangarajan wrote: I have a 28 GB Solr 4.6 index with 45 segments. Optimize failed with an 'out of memory' error. Is optimize really necessary, since I read that lucene is able to handle multiple segments well now? It seems I currently run into the same problem while migrating from Solr 1.4 to Solr 4.6.1. I run into OOM-Problems -- after running a full, fresh re-index of your catalogue data -- while optimizing an ~80GB core on a 16GB JVM. After about one hour the heap explodes within a minute while create compound file _5b2.cfs. How to deal with this? Wit it happens because there are too much small (about 30 @ 1..4GB) segments before optimize? It seem that they are limited to this size by the defaults of the TieredMergePolicy. And, of course: Is optimizedepreciated? Because it takes about 1h to reach the point of prolems any hints or explanations will be helpful for me to save a lot of time! Replying to a privately sent email on this thread: I can't be sure that there are no memory leaks in Solr's program code, but it is a rare thing, and I'm running 4.6.1 on a large system with a smaller heap than yours without problems, so a memory leak is unlikely. My setup DOES do index optimizes. I have two guesses. It could be either or both. They are similar but not identical. There might be something else entirely, but these are the most likely: One guess is that you don't have enough RAM, leading to a performance issue that compounds itself. Adding the optimize pushes the system over a threshold, everything slows down enough that the system tries to do too much simultaneously, and it uses all the heap. Assuming there's nothing else running on the machine, with an 80GB index and a 16GB heap, a perfectly ideal server for this index would have 96GB of RAM. You might be able to get really good performance with 48GB, but more would be better. If it were me, I don't think I'd try it with less than 64GB. http://wiki.apache.org/solr/SolrPerformanceProblems#RAM The other guess is that your Solr config and your request/index characteristicsare resulting in a lot of heap usage, so when you add an optimize on top of it, 16GB is not enough. http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap Thanks, Shawn
Re: solr-query with NOT and OR operator
Hi Jack, thanks! fq=((*:* -(field1:value1)))+OR+(field2:value2). This is the solution. Johannes Am 11.02.2014 17:22, schrieb Jack Krupansky: With so many parentheses in there, I wonder what you are really trying to do Try expressing your query in simple English first so that we can understand your goal. But generally, a purely negative nested query must have a *:* term to apply the exclusion against: fq=((*:* -(field1:value1)))+OR+(field2:value2). -- Jack Krupansky -Original Message- From: Johannes Siegert Sent: Tuesday, February 11, 2014 10:57 AM To: solr-user@lucene.apache.org Subject: solr-query with NOT and OR operator Hi, my solr-request contains the following filter-query: fq=((-(field1:value1)))+OR+(field2:value2). I expect solr deliver documents matching to ((-(field1:value1))) and documents matching to (field2:value2). But solr deliver only documents, that are the result of (field2:value2). I receive several documents, if I request only for ((-(field1:value1))). Thanks! Johannes -- Johannes Siegert Softwareentwickler Telefon: 0351 - 418 894 -73 Fax: 0351 - 418 894 -99 E-Mail: johannes.sieg...@marktjagd.de Xing: https://www.xing.com/profile/Johannes_Siegert2 Webseite: http://www.marktjagd.de Blog: http://blog.marktjagd.de Facebook: http://www.facebook.com/marktjagd Twitter: http://twitter.com/Marktjagd __ Marktjagd GmbH | Schützenplatz 14 | D - 01067 Dresden Geschäftsführung: Jan Großmann Sitz Dresden | Amtsgericht Dresden | HRB 28678
Re: Is 'optimize' necessary for a 45-segment Solr 4.6 index?
Dear Shawn, Thanks for your reply. For now, I did merges in steps with maxSegments param (using HOST:PORT/CORE/update?optimize=truemaxSegments=10). First I merged the 45 segments to 10, and then from 10 to 5. (Merging from 5 to 2 again caused out-of-memory exception.) Now I have a 5-segment index with all segments roughly of equal sizes. Will try using that and see if that is good enough for us. On Sun, Feb 9, 2014 at 11:22 PM, Shawn Heisey s...@elyograg.org wrote: On 2/9/2014 11:41 PM, Arun Rangarajan wrote: I have a 28 GB Solr 4.6 index with 45 segments. Optimize failed with an 'out of memory' error. Is optimize really necessary, since I read that lucene is able to handle multiple segments well now? I have had indexes with more than 45 segments, because of the merge settings that I use. My large index shards are about 16GB at the moment. Out of memory errors are very rare because I use a fairly large heap, at 6GB for a machine that hosts three of these large shards. When I was still experimenting with my memory settings, I did see occasional out of memory errors during normal segment merging. Increasing your heap size is pretty much required at this point. I've condensed some very basic information about heap sizing here: http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap As for whether optimizing on 4.x is necessary: I do not have any hard numbers for you, but I can tell you that an optimized index does seem noticeably faster than one that is freshly built and has has a large number of relatively large segments. I optimize my index shards on an schedule, but it is relatively infrequent -- one large shard per night. Most of the time what I have is one really large segment and a bunch of super-small segments, and that does not seem to suffer from performance issues compared to a fully optimized index. The situation is different right after a fresh rebuild, which produces a handful of very large segments and a bunch of smaller segments of varying sizes. Interesting but probably irrelevant details: Although I don't use mergeFactor any more, the TieredMergePolicy settings that I use are equivalent to a mergeFactor of 35. I chose this number back in the 1.4.1 days because it resulted in synchronicity between merges and lucene segment names when LogByteSizeMergePolicy was still in use. Segments _0 through _z would be merged into segment _10, and so on. Thanks, Shawn
handleSelect=true with SolrCloud
I’m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at present.) The old query style relied on using /solr/select?qt=foo to select the proper requestHandler. I know handleSelect=true is deprecated now, but it’d be pretty handy for testing to be able to be backwards compatible, at least until some time after the initial release. So in my SolrCloud configuration, I set requestDispatcher handleSelect=true” and deleted the /select requestHandler as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Resolution_.28qt_param.29 However, my /solr/collection1/select?qt=foo query throws an “unknown handler: null” error with this configuration. Has anyone successfully tried handleSelect=true with the collections api? Thanks.
boost group doclist members
Without falling into the x/y problem area, I'll explain what I want to do: I would like to group my result set by a field, f1 and within each group, I'd like to boost the score of the most appropriate member of the group so it appears first in the doc list. The most appropriate member is defined by the content of other fields (e.g., f2, f3). So basically, I'd like to boost based on the values in fields f2 and f3. If there is a better way to achieve this, I'm all ears. But I was thinking this could be achieved by using a function query as the sortspec to group.sort. Example content: doc field name=f14181770/field !-- integer -- field name=f2x_val/field !-- text -- field name=f3100/field !-- integer -- /doc doc field name=f14181770/field field name=f2y_val/field field name=f3100/field /doc doc field name=f14181770/field field name=f2z_val/field field name=f3100/field /doc All 3 of the above documents will be grouped into a doclist with groupValue=4181770. My questions is then, How do I make the document with f2=y_val appear first in the doclist. I've been playing with group.field=f1 group.sort=query({!dismax qf=f2 bq=f2:y_val^100}) asc ... but I'm getting: org.apache.solr.common.SolrException: Can't determine a Sort Order (asc or desc) in sort spec 'query({!dismax qf=f2 bq=f2:y_val^100.0}) asc', pos=14. Can anyone point to a some examples of this? thanks David
Re: handleSelect=true with SolrCloud
On 2/11/2014 10:21 AM, Jeff Wartes wrote: I’m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at present.) The old query style relied on using /solr/select?qt=foo to select the proper requestHandler. I know handleSelect=true is deprecated now, but it’d be pretty handy for testing to be able to be backwards compatible, at least until some time after the initial release. So in my SolrCloud configuration, I set requestDispatcher handleSelect=true” and deleted the /select requestHandler as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Resolution_.28qt_param.29 However, my /solr/collection1/select?qt=foo query throws an “unknown handler: null” error with this configuration. Has anyone successfully tried handleSelect=true with the collections api? I'm pretty sure that if you won't have a handler named /select, then you need to have default=true as an attribute on one of your other handler definitions. See line 715 of the example solrconfig.xml for Solr 3.5: http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/example/solr/conf/solrconfig.xml?view=annotate Thanks, Shawn
Re: handleSelect=true with SolrCloud
Got it in one. Thanks! On 2/11/14, 9:50 AM, Shawn Heisey s...@elyograg.org wrote: On 2/11/2014 10:21 AM, Jeff Wartes wrote: I¹m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at present.) The old query style relied on using /solr/select?qt=foo to select the proper requestHandler. I know handleSelect=true is deprecated now, but it¹d be pretty handy for testing to be able to be backwards compatible, at least until some time after the initial release. So in my SolrCloud configuration, I set requestDispatcher handleSelect=true² and deleted the /select requestHandler as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Re solution_.28qt_param.29 However, my /solr/collection1/select?qt=foo query throws an ³unknown handler: null² error with this configuration. Has anyone successfully tried handleSelect=true with the collections api? I'm pretty sure that if you won't have a handler named /select, then you need to have default=true as an attribute on one of your other handler definitions. See line 715 of the example solrconfig.xml for Solr 3.5: http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/exam ple/solr/conf/solrconfig.xml?view=annotate Thanks, Shawn
Re: USER NAME Baruch Labunski
Hello Wiki admin, I would like to some value links. Can you please add me, my user name is Baruch Labunski Thank You, Baruch! On Thursday, January 16, 2014 2:12:32 PM, Baruch bar...@rogers.com wrote: Hello Wiki admin, I would like to some value links. Can you please add me, my user name is Baruch Labunski Thank You, Baruch!
Re: Lowering query time
It's a custom ingestion process. It does a big DB query and then inserts stuff in batches. The batch size is tuneable. On Tue, Feb 11, 2014 at 11:23 AM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, I'm still a little puzzled BTW. 300K documents, unless they're huge, shouldn't be taking 100 minutes. I can index 11M documents on my laptop (Wikipedia dump) in 45 minutes for instance Of course that's a single core, not cloud and not replicas... So possibly it' on the data acquisition side? Is your Solr CPU pegged? YMMV of course. Erick On Tue, Feb 11, 2014 at 6:40 AM, Joel Cohen joel.co...@bluefly.com wrote: I'd like to thank you for lending a hand on my query time problem with SolrCloud. By switching to a single shard with replicas setup, I've reduced my query time to 18 msec. My full ingestion of 300k+ documents went down from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes that are going in that should help a bit as well. Big thanks to everyone that had suggestions. On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I suspect faceting is the issue here. The actual query you have shown seem to bring back a single document (or a single set of document for a product): fq=id:(320403401) On the other hand, you are asking for 4 field facets: facet.field=q_virtualCategory_ss facet.field=q_brand_s facet.field=q_color_s facet.field=q_category_ss AND 2 range facets, both clustered/grouped: facet.range=daysSinceStart_i facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000) And for all facets you have asked to bring back ALL of the results: facet.limit=-1 Plus, you are doing a complex sort: sort=popularity_i desc,popularity_i desc So, you are probably spending quite a bit of time counting (especially in a shared setup) and then quite a bit more sending the response back. I would check the size of the result document (HTTP result) and see how large it is. Maybe you don't need all of the stuff that's coming back. I assume you are not actually querying Solr from the client's machine (that is I hope it is inside your data centre close to your web server), otherwise I would say to look at automatic content compression as well to minimize on-wire document size. Finally, if your documents have many stored fields (store=true in schema.xml) but you only return small subsets of them during search, you could look into using enableLazyFieldLoading flag in the solrconfig. Regards, Alex. P.s. As others said, you don't seem to have too many documents. Perhaps you want replication instead of sharding for improved performance. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin alexey_kozhemia...@epam.com wrote: Btw timing for distributed requests are broken at this moment, it doesn't combine values from requests to shards. I'm working on a patch. https://issues.apache.org/jira/browse/SOLR-3644 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Tuesday, February 04, 2014 22:00 To: solr-user@lucene.apache.org Subject: Re: Lowering query time Add the debug=true parameter to some test queries and look at the timing section to see which search components are taking the time. Traditionally, highlighting for large documents was a top culprit. Are you returning a lot of data or field values? Sometimes reducing the amount of data processed can help. Any multivalued fields with lots of values? -- Jack Krupansky -Original Message- From: Joel Cohen Sent: Tuesday, February 4, 2014 1:43 PM To: solr-user@lucene.apache.org Subject: Re: Lowering query time 1. We are faceting. I'm not a developer so I'm not quite sure how we're doing it. How can I measure? 2. I'm not sure how we'd force this kind of document partitioning. I can see how my shards are partitioned by looking at the clusterstate.json from Zookeeper, but I don't have a clue on how to get documents into specific shards. Would I be better off with fewer shards given the small size of my indexes? On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley yo...@heliosearch.com wrote: On Tue, Feb 4, 2014 at 12:12 PM, Joel Cohen joel.co...@bluefly.com wrote: I'm trying to get the query time down to ~15 msec. Anyone have any tuning recommendations? I guess it depends on what the slowest part of the query currently is. If you are faceting, it's often that.
Re: handleSelect=true with SolrCloud
Jeff, I believe the shards.qt parameter is what you're looking for. For example when using the /elevate handler with SolrCloud I use the following url to tell Solr to use the /elevate handler on the shards: http://localhost:8983/solr/collection1/elevate?q=ipodwt=jsonindent=trueshards.qt=/elevate Joel Bernstein Search Engineer at Heliosearch On Tue, Feb 11, 2014 at 1:01 PM, Jeff Wartes jwar...@whitepages.com wrote: Got it in one. Thanks! On 2/11/14, 9:50 AM, Shawn Heisey s...@elyograg.org wrote: On 2/11/2014 10:21 AM, Jeff Wartes wrote: I¹m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at present.) The old query style relied on using /solr/select?qt=foo to select the proper requestHandler. I know handleSelect=true is deprecated now, but it¹d be pretty handy for testing to be able to be backwards compatible, at least until some time after the initial release. So in my SolrCloud configuration, I set requestDispatcher handleSelect=true² and deleted the /select requestHandler as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Re solution_.28qt_param.29 However, my /solr/collection1/select?qt=foo query throws an ³unknown handler: null² error with this configuration. Has anyone successfully tried handleSelect=true with the collections api? I'm pretty sure that if you won't have a handler named /select, then you need to have default=true as an attribute on one of your other handler definitions. See line 715 of the example solrconfig.xml for Solr 3.5: http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/exam ple/solr/conf/solrconfig.xml?view=annotate Thanks, Shawn
RE: handleSelect=true with SolrCloud
Hi Jeff, it is not with elevated, I am talking in the link of Relevancy / Boost/ Score. Select productid from products where SKU = 101 Select Productid from products where ManufactureSKU = 101 Select Productid from product where SKU Like 101% Select Productid from Product where ManufactureSKU like 101% Select Productid from product where Name Like 101% Select Productid from Product where Description like '%101% Is there any way in Solr can search the exact match,starts with and anywhere.. in single solr query. -Original Message- From: Joel Bernstein [mailto:joels...@gmail.com] Sent: Tuesday, February 11, 2014 3:11 PM To: solr-user@lucene.apache.org Subject: Re: handleSelect=true with SolrCloud Jeff, I believe the shards.qt parameter is what you're looking for. For example when using the /elevate handler with SolrCloud I use the following url to tell Solr to use the /elevate handler on the shards: http://localhost:8983/solr/collection1/elevate?q=ipodwt=jsonindent=trueshards.qt=/elevate Joel Bernstein Search Engineer at Heliosearch On Tue, Feb 11, 2014 at 1:01 PM, Jeff Wartes jwar...@whitepages.com wrote: Got it in one. Thanks! On 2/11/14, 9:50 AM, Shawn Heisey s...@elyograg.org wrote: On 2/11/2014 10:21 AM, Jeff Wartes wrote: I¹m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at present.) The old query style relied on using /solr/select?qt=foo to select the proper requestHandler. I know handleSelect=true is deprecated now, but it¹d be pretty handy for testing to be able to be backwards compatible, at least until some time after the initial release. So in my SolrCloud configuration, I set requestDispatcher handleSelect=true² and deleted the /select requestHandler as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue _Re solution_.28qt_param.29 However, my /solr/collection1/select?qt=foo query throws an ³unknown handler: null² error with this configuration. Has anyone successfully tried handleSelect=true with the collections api? I'm pretty sure that if you won't have a handler named /select, then you need to have default=true as an attribute on one of your other handler definitions. See line 715 of the example solrconfig.xml for Solr 3.5: http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/ exam ple/solr/conf/solrconfig.xml?view=annotate Thanks, Shawn
Solr Autosuggest - Strange issue with leading numbers in query
I have a strange issue with Autosuggest. Whenever I query for a keyword along with numbers (leading) it returns the suggestion corresponding to the alphabets (ignoring the numbers). I was under assumption that it will return an empty result back. I am not sure what I am doing wrong. Can someone help? *Query:* /autocomplete?qt=/lucidreq_type=auto_completespellcheck.maxCollations=10q=12342343243242gaspellcheck.count=10 *Result:* response lst name=responseHeader int name=status0/int int name=QTime1/int /lst lst name=spellcheck lst name=suggestions lst name=ga int name=numFound1/int int name=startOffset15/int int name=endOffset17/int arr name=suggestion strgalaxy/str /arr /lst str name=collation12342343243242galaxy/str /lst /lst /response *My field configuration is as below:* fieldType class=solr.TextField name=textSpell_word positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory enablePositionIncrements=true ignoreCase=true words=stopwords_autosuggest.txt/ /analyzer /fieldType *SolrConfig.xml* searchComponent class=solr.SpellCheckComponent name=autocomplete lst name=spellchecker str name=nameautocomplete/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldautocomplete_word/str str name=storeDirautocomplete/str str name=buildOnCommittrue/str float name=threshold.005/float /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/autocomplete lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionaryautocomplete/str str name=spellcheck.collatetrue/str str name=spellcheck.count10/str str name=spellcheck.onlyMorePopularfalse/str /lst arr name=components strautocomplete/str /arr /requestHandler -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Autosuggest-Strange-issue-with-leading-numbers-in-query-tp4116751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing question on individual field update
Eric, Thanks for your reply. I should have given a better context. I'm currently running an incremental crawl daily on this particular source and indexing the documents. Incremental crawl looks for any change since last crawl date based on the document publish date. But, there's no way for me to know if a document has been deleted. To ensure that, I ran a full crawl on a weekend, which basically re-index the entire content. After the full index is over, I call a purge script, which deletes any content which is more than 24 hour old, based on the indextimestamp field. The issue with atomic update is that it doesn't alter the indextimstamp field. So even if I run a full crawl with atomic updates, the timestamp will stick to its old value. Unfortunately, I can't rely on another date field coming from the source as they are not consistent. That translates to the fact that I can't remove stale content. Let me know if I'm missing something here. - Thanks, Shamik -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116757.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr server requirements for 100+ million documents
Hi Otis, Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1 for Zookeeper. Is that correct? Thanks, Susheel -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Friday, January 24, 2014 5:21 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Hi Susheel, Like Erick said, it's impossible to give precise recommendations, but making a few assumptions and combining them with experience (+ a licked finger in the air): * 3 servers * 32 GB * 2+ CPU cores * Linux Assuming docs are not bigger than a few KB, that they are not being reindexed over and over, that you don't have a search rate higher than a few dozen QPS, assuming your queries are not a page long, etc. assuming best practices are followed, the above should be sufficient. I hope this helps. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: Solr server requirements for 100+ million documents
Hi Susheel, No, we wouldn't want to go with just 1 ZK. :) Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi Otis, Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1 for Zookeeper. Is that correct? Thanks, Susheel -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Friday, January 24, 2014 5:21 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Hi Susheel, Like Erick said, it's impossible to give precise recommendations, but making a few assumptions and combining them with experience (+ a licked finger in the air): * 3 servers * 32 GB * 2+ CPU cores * Linux Assuming docs are not bigger than a few KB, that they are not being reindexed over and over, that you don't have a search rate higher than a few dozen QPS, assuming your queries are not a page long, etc. assuming best practices are followed, the above should be sufficient. I hope this helps. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
RE: Solr server requirements for 100+ million documents
Thanks, Otis for quick reply. So for ZK do you recommend separate servers and if so how many for initial Solr cloud cluster setup. -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Tuesday, February 11, 2014 4:21 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Hi Susheel, No, we wouldn't want to go with just 1 ZK. :) Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi Otis, Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1 for Zookeeper. Is that correct? Thanks, Susheel -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Friday, January 24, 2014 5:21 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Hi Susheel, Like Erick said, it's impossible to give precise recommendations, but making a few assumptions and combining them with experience (+ a licked finger in the air): * 3 servers * 32 GB * 2+ CPU cores * Linux Assuming docs are not bigger than a few KB, that they are not being reindexed over and over, that you don't have a search rate higher than a few dozen QPS, assuming your queries are not a page long, etc. assuming best practices are followed, the above should be sufficient. I hope this helps. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: Indexing question on individual field update
Ok, I was wrong here. I can always set the indextimestamp field with current time (NOW) for every atomic update. On a similar note, is there any performance constraint with updates compared to add ? -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116772.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr server requirements for 100+ million documents
ZK needs a quorum to keep functional so 3 servers handles one failure. 5 handles 2 node failures. If you Solr with 1 replica per shard then stick to 3 ZK. If you use 2 replicas use 5 ZK
Replica node down but zookeeper clusterstate not updated
Solr = 4.6.1, attached solrcloud admin console view Zookeeper 3.4.5 = 3 node ensemble In my test setup, I have 3 Node SolrCloud setup with 2 shard. Today we had power failure and all node went down. I started 3 node zookeeper ensemble first then followed with 3 node solrcloud, and one of replica ip address was change due to dynamic ip allocation but zookeeper clusterstate is not updated with new ip address and it was still holding old ip address for that bad node. Do I need to manually update clusterstate in zookeeper? what are my options if this could happen in production. Bad node: old IP:10.249.132.35 (still exist in zookeeper) new IP: 10.249.133.10 Log from Node1: 11:26:25,242 INFO [STDOUT] 49170786 [Thread-2-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader â A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 3) 11:26:41,072 INFO [STDOUT] 49186615 [RecoveryThread] INFO org.apache.solr.cloud.ZkController â publishing core=genre_shard1_replica1 state=recovering 11:26:41,079 INFO [STDOUT] 49186622 [RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy â Error while trying to recover. core=genre_shard1_replica1:org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://10.249.132.35:8080/solr 11:26:41,079 INFO [STDOUT] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:496) 11:26:41,079 INFO [STDOUT] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) 11:26:41,079 INFO [STDOUT] at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:221) 11:26:41,079 INFO [STDOUT] at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:367) 11:26:41,079 INFO [STDOUT] at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:244) 11:26:41,079 INFO [STDOUT] Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://10.249.132.35:8080 refused 11:27:14,036 INFO [STDOUT] 49219580 [RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy â Recovery failed - trying again... (9) core=geo_shard1_replica1 11:27:14,037 INFO [STDOUT] 49219581 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy â Wait 600.0 seconds before trying to recover again (10) 11:27:14,958 INFO [STDOUT] 49220498 [Thread-40] INFO org.apache.solr.common.cloud.ZkStateReader â Updating cloud state from ZooKeeper... Log from bad node with new ip address: 11:06:29,551 INFO [STDOUT] 6234 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.cloud.ShardLeaderElectionContext â Enough replicas found to continue. 11:06:29,552 INFO [STDOUT] 6236 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.cloud.ShardLeaderElectionContext â I may be the new leader - try and sync 11:06:29,554 INFO [STDOUT] 6237 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.cloud.SyncStrategy â Sync replicas to http://10.249.132.35:8080/solr/venue_shard2_replica2/ 11:06:29,555 INFO [STDOUT] 6239 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.update.PeerSync â PeerSync: core=venue_shard2_replica2 url=http://10.249.132.35:8080/solr START replicas=[ http://10.249.132.56:8080/solr/venue_shard2_replica1/] nUpdates=100 11:06:29,556 INFO [STDOUT] 6240 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.update.PeerSync â PeerSync: core=venue_shard2_replica2 url=http://10.249.132.35:8080/solr DONE. We have no versions. sync failed. 11:06:29,556 INFO [STDOUT] 6241 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.cloud.SyncStrategy â Leader's attempt to sync with shard failed, moving to the next candidate 11:06:29,558 INFO [STDOUT] 6241 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.cloud.ShardLeaderElectionContext â We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 11:06:29,559 INFO [STDOUT] 6243 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.cloud.ShardLeaderElectionContext â I am the new leader: http://10.249.132.35:8080/solr/venue_shard2_replica2/ shard2 11:06:29,561 INFO [STDOUT] 6245 [coreLoadExecutor-4-thread-10] INFO org.apache.solr.common.cloud.SolrZkClient â makePath: /collections/venue/leaders/shard2 11:06:29,577 INFO [STDOUT] 6261 [Thread-2-EventThread] INFO org.apache.solr.update.PeerSync â PeerSync: core=event_shard2_replica2 url=http://10.249.132.35:8080/solr Received 18 versions from 10.249.132.56:8080/solr/event_shard2_replica1/ 11:06:29,578 INFO [STDOUT] 6263 [Thread-2-EventThread] INFO org.apache.solr.update.PeerSync â PeerSync: core=event_shard2_replica2 url=http://10.249.132.35:8080/solr Requesting updates from 10.249.132.56:8080/solr/event_shard2_replica1/n=10versions=[1457764666067386368, 1456709993140060160, 1456709989863260160, 1456709986075803648, 1456709971758546944, 1456709179685208064, 1456709137524064256,
Re: Indexing question on individual field update
On 2/11/2014 2:37 PM, shamik wrote: Eric, Thanks for your reply. I should have given a better context. I'm currently running an incremental crawl daily on this particular source and indexing the documents. Incremental crawl looks for any change since last crawl date based on the document publish date. But, there's no way for me to know if a document has been deleted. To ensure that, I ran a full crawl on a weekend, which basically re-index the entire content. After the full index is over, I call a purge script, which deletes any content which is more than 24 hour old, based on the indextimestamp field. The issue with atomic update is that it doesn't alter the indextimstamp field. So even if I run a full crawl with atomic updates, the timestamp will stick to its old value. Unfortunately, I can't rely on another date field coming from the source as they are not consistent. That translates to the fact that I can't remove stale content. One possibility is this: When you send the atomic update to Solr, include a new value for the indextimestamp field. Another option: You can write a custom update processor plugin for Solr. When the custom code is used, it will be executed on each incoming document. Depending on what it finds in the update request, it can make appropriate changes, like updating indextimestamp. You can do pretty much anything. http://wiki.apache.org/solr/UpdateRequestProcessor Writing an update processor in Java typically gives the best results in terms of flexibility and performance, but there is also a way to use other programming languages: http://wiki.apache.org/solr/ScriptUpdateProcessor Thanks, Shawn
Re: Solr server requirements for 100+ million documents
Whether you use the same machines as Solr or separate machines is a matter suited to taste. If you are the CTO, then you should make this decision. If not, inform management that risk conditions are greater when you share function and control on a single piece of hardware. A single failure of a replica + zookeeper node will be more impactful than a single failure of a replica *or* a zookeeper node. Let them earn the big bucks to make the risk decision. The good news is, zookeeper hardware can be extremely lightweight for Solr Cloud. Commodity hardware should work just fine…and thus scaling to 5 nodes for zookeeper is not that hard at all. Jason On Feb 11, 2014, at 3:00 PM, svante karlsson s...@csi.se wrote: ZK needs a quorum to keep functional so 3 servers handles one failure. 5 handles 2 node failures. If you Solr with 1 replica per shard then stick to 3 ZK. If you use 2 replicas use 5 ZK
Re: Solr server requirements for 100+ million documents
On 2/11/2014 3:28 PM, Susheel Kumar wrote: Thanks, Otis for quick reply. So for ZK do you recommend separate servers and if so how many for initial Solr cloud cluster setup. In a minimal 3-server setup, all servers would run zookeeper and two of them would also run Solr.With this setup, you can survive the failure of any of those three machines, even if it dies completely. If the third machine is only running zookeeper, two fast CPU cores and 2GB of RAM would be plenty. For 100 million documents, I would personally recommend at least 8 CPU cores on the machines running Solr, ideally provided by at least two separate physical CPUs. Otis recommended 32GB of RAM as a starting point. You would very likely want more. One copy of my 90 million document index uses two servers to run all the shards. Because I have two copies of the index, I have four servers. Each server has 64GB of RAM. This is **NOT** running SolrCloud, but if it were, I would have zookeeper running on three of those servers. Thanks, Shawn
Re: FuzzyLookupFactory with exactMatchFirst not giving the exact match.
I've tried the new SuggestComponent, however it doesn't work quite as expected. It returns the full field value rather than a list of corrections for the specific term. I can see how SuggestComponent would be excellent for phrase suggestions and document lookups, but it doesn't seem to be suitable for a per-word spelling suggestion. Correct me if I'm wrong. I'm taking another look at solr.SpellCheckComponenet. I've switched on `spellcheck.extendedResults` but the response `correctlySpelled` is always false, regardless of other settings. It seems it's an example SOLR-4278. In that ticket James Dyer says: You can tell if the user's keywords exist in the index on a term-by-term basis by specifying spellcheck.extendedResults=true. Then look under each lst name=ORIG_KEYWORD for int name=origFreq0/int. This would be suit me perfectly - but `origFreq` does not appear in the response at all. I'm looking that code but tracing down how the token frequency is added is leading me down a deep and dark rabbit hole :). Am I missing something basic here? On Tue, Feb 11, 2014 at 3:59 PM, Areek Zillur areek...@gmail.com wrote: Dont worry about the analysis chain, I realized you are using the spellcheck component for suggestions. The suggestion gets returned from the Lucene layer, but unfortunately the Spellcheck component strips the suggestion out as it is mainly built for spell checking (when the query token == suggestion; spelling is correct, so why suggest it!). You can try out the SuggestComponent (SOLR-5378), it does the right thing in this situation. On Mon, Feb 10, 2014 at 9:30 PM, Areek Zillur areek...@gmail.com wrote: That should not be the case, Maybe the analysis-chain of 'text_spell' is doing something before the key hits the suggester (you want to use something like KeywordTokenizerFactory)? Also maybe specify the queryAnalyzerFieldType in the suggest component config? you might want to do something similar to solr-config: ( https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml ) [look at suggest_analyzing component] and schema: ( https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/schema-phrasesuggest.xml ) [look at phrase_suggest field type]. On Mon, Feb 10, 2014 at 8:44 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Same issue with AnalyzingLookupFactory - I'll get autocomplete suggestions but not the original query. On Tue, Feb 11, 2014 at 1:57 PM, Areek Zillur areek...@gmail.com wrote: The FuzzyLookupFactory should accept all the options as that of as AnalyzingLookupFactory ( http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/AnalyzingLookupFactory.html ). [FuzzySuggester is a direct subclass of the AnalyzingSuggester in lucene]. Have you tried the exactMatchFirst with the AnalyzingLookupFactory? Does AnalyzingLookup have the same problem with the exactMatchFirst option? On Mon, Feb 10, 2014 at 6:00 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Looking at: http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/FuzzyLookupFactory.html It seems that exactMatchFirst is not a valid option for FuzzyLookupFactory. Potential workarounds? On Mon, Feb 10, 2014 at 5:04 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Hi all, I've got a FuzzyLookupFactory spellchecker with exactMatchFirst enabled. A query like tes will return test and testing, but a query for test will *not* return test even though it is clearly in the dictionary. Why would this be? Relevant config follows searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str !-- Implementation -- str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.fst.FuzzyLookupFactory/str !-- Properties -- bool name=preserveSepfalse/bool bool name=exactMatchFirsttrue/bool str name=suggestAnalyzerFieldTypetext_spell/str float name=threshold0.005/float !-- Do not build on each commit, bad for performance. See cron. str name=buildOnCommitfalse/str -- !-- Source -- str name=fieldsuggest/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopulartrue/str str
Re: FuzzyLookupFactory with exactMatchFirst not giving the exact match.
Ah, I think the term frequency is only available for the Spellcheckers rather than the Suggesters - so I tried a DirectSolrSpellChecker. This gave me good spelling suggestions for misspelt terms, but if the term is spelled correctly I, again, get no term information and correctlySpelled is false. Back to square 1. On Wed, Feb 12, 2014 at 12:37 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: I've tried the new SuggestComponent, however it doesn't work quite as expected. It returns the full field value rather than a list of corrections for the specific term. I can see how SuggestComponent would be excellent for phrase suggestions and document lookups, but it doesn't seem to be suitable for a per-word spelling suggestion. Correct me if I'm wrong. I'm taking another look at solr.SpellCheckComponenet. I've switched on `spellcheck.extendedResults` but the response `correctlySpelled` is always false, regardless of other settings. It seems it's an example SOLR-4278. In that ticket James Dyer says: You can tell if the user's keywords exist in the index on a term-by-term basis by specifying spellcheck.extendedResults=true. Then look under each lst name=ORIG_KEYWORD for int name=origFreq0/int. This would be suit me perfectly - but `origFreq` does not appear in the response at all. I'm looking that code but tracing down how the token frequency is added is leading me down a deep and dark rabbit hole :). Am I missing something basic here? On Tue, Feb 11, 2014 at 3:59 PM, Areek Zillur areek...@gmail.com wrote: Dont worry about the analysis chain, I realized you are using the spellcheck component for suggestions. The suggestion gets returned from the Lucene layer, but unfortunately the Spellcheck component strips the suggestion out as it is mainly built for spell checking (when the query token == suggestion; spelling is correct, so why suggest it!). You can try out the SuggestComponent (SOLR-5378), it does the right thing in this situation. On Mon, Feb 10, 2014 at 9:30 PM, Areek Zillur areek...@gmail.com wrote: That should not be the case, Maybe the analysis-chain of 'text_spell' is doing something before the key hits the suggester (you want to use something like KeywordTokenizerFactory)? Also maybe specify the queryAnalyzerFieldType in the suggest component config? you might want to do something similar to solr-config: ( https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml ) [look at suggest_analyzing component] and schema: ( https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/schema-phrasesuggest.xml ) [look at phrase_suggest field type]. On Mon, Feb 10, 2014 at 8:44 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Same issue with AnalyzingLookupFactory - I'll get autocomplete suggestions but not the original query. On Tue, Feb 11, 2014 at 1:57 PM, Areek Zillur areek...@gmail.com wrote: The FuzzyLookupFactory should accept all the options as that of as AnalyzingLookupFactory ( http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/AnalyzingLookupFactory.html ). [FuzzySuggester is a direct subclass of the AnalyzingSuggester in lucene]. Have you tried the exactMatchFirst with the AnalyzingLookupFactory? Does AnalyzingLookup have the same problem with the exactMatchFirst option? On Mon, Feb 10, 2014 at 6:00 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Looking at: http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/FuzzyLookupFactory.html It seems that exactMatchFirst is not a valid option for FuzzyLookupFactory. Potential workarounds? On Mon, Feb 10, 2014 at 5:04 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Hi all, I've got a FuzzyLookupFactory spellchecker with exactMatchFirst enabled. A query like tes will return test and testing, but a query for test will *not* return test even though it is clearly in the dictionary. Why would this be? Relevant config follows searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str !-- Implementation -- str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.fst.FuzzyLookupFactory/str !-- Properties -- bool name=preserveSepfalse/bool bool name=exactMatchFirsttrue/bool str name=suggestAnalyzerFieldTypetext_spell/str float name=threshold0.005/float !-- Do not build on each commit, bad for performance. See cron.
Solr performance on a very huge data set
Hello Dear, I have 1000 GB of data that I want to index. Assuming I have enough space for storing the indexes in a single machine. *I would like to get an idea about Solr performance for searching an item from a huge data set. Do I need to use shards for improving the Solr search efficiency or it is OK to search without sharding ?* I will use SolrCloud for high availability and fault tolerance with the help of zoo-keeper. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-on-a-very-huge-data-set-tp4116792.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: USER NAME Baruch Labunski
Baruch: Is that your Wiki ID? We need that. But sure, we'll be happy to add you to the list... On Tue, Feb 11, 2014 at 11:03 AM, Baruch bar...@rogers.com wrote: Hello Wiki admin, I would like to some value links. Can you please add me, my user name is Baruch Labunski Thank You, Baruch! On Thursday, January 16, 2014 2:12:32 PM, Baruch bar...@rogers.com wrote: Hello Wiki admin, I would like to some value links. Can you please add me, my user name is Baruch Labunski Thank You, Baruch!
Re: Lowering query time
So my guess is you're spending by far the largest portion of your time doing the DB query(ies), which makes sense On Tue, Feb 11, 2014 at 11:50 AM, Joel Cohen joel.co...@bluefly.com wrote: It's a custom ingestion process. It does a big DB query and then inserts stuff in batches. The batch size is tuneable. On Tue, Feb 11, 2014 at 11:23 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, I'm still a little puzzled BTW. 300K documents, unless they're huge, shouldn't be taking 100 minutes. I can index 11M documents on my laptop (Wikipedia dump) in 45 minutes for instance Of course that's a single core, not cloud and not replicas... So possibly it' on the data acquisition side? Is your Solr CPU pegged? YMMV of course. Erick On Tue, Feb 11, 2014 at 6:40 AM, Joel Cohen joel.co...@bluefly.com wrote: I'd like to thank you for lending a hand on my query time problem with SolrCloud. By switching to a single shard with replicas setup, I've reduced my query time to 18 msec. My full ingestion of 300k+ documents went down from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes that are going in that should help a bit as well. Big thanks to everyone that had suggestions. On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I suspect faceting is the issue here. The actual query you have shown seem to bring back a single document (or a single set of document for a product): fq=id:(320403401) On the other hand, you are asking for 4 field facets: facet.field=q_virtualCategory_ss facet.field=q_brand_s facet.field=q_color_s facet.field=q_category_ss AND 2 range facets, both clustered/grouped: facet.range=daysSinceStart_i facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000) And for all facets you have asked to bring back ALL of the results: facet.limit=-1 Plus, you are doing a complex sort: sort=popularity_i desc,popularity_i desc So, you are probably spending quite a bit of time counting (especially in a shared setup) and then quite a bit more sending the response back. I would check the size of the result document (HTTP result) and see how large it is. Maybe you don't need all of the stuff that's coming back. I assume you are not actually querying Solr from the client's machine (that is I hope it is inside your data centre close to your web server), otherwise I would say to look at automatic content compression as well to minimize on-wire document size. Finally, if your documents have many stored fields (store=true in schema.xml) but you only return small subsets of them during search, you could look into using enableLazyFieldLoading flag in the solrconfig. Regards, Alex. P.s. As others said, you don't seem to have too many documents. Perhaps you want replication instead of sharding for improved performance. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin alexey_kozhemia...@epam.com wrote: Btw timing for distributed requests are broken at this moment, it doesn't combine values from requests to shards. I'm working on a patch. https://issues.apache.org/jira/browse/SOLR-3644 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Tuesday, February 04, 2014 22:00 To: solr-user@lucene.apache.org Subject: Re: Lowering query time Add the debug=true parameter to some test queries and look at the timing section to see which search components are taking the time. Traditionally, highlighting for large documents was a top culprit. Are you returning a lot of data or field values? Sometimes reducing the amount of data processed can help. Any multivalued fields with lots of values? -- Jack Krupansky -Original Message- From: Joel Cohen Sent: Tuesday, February 4, 2014 1:43 PM To: solr-user@lucene.apache.org Subject: Re: Lowering query time 1. We are faceting. I'm not a developer so I'm not quite sure how we're doing it. How can I measure? 2. I'm not sure how we'd force this kind of document partitioning. I can see how my shards are partitioned by looking at the clusterstate.json from Zookeeper, but I don't have a clue on how to get documents into specific shards. Would I be better off with fewer shards given the small size of my indexes? On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley
Re: Solr Autosuggest - Strange issue with leading numbers in query
Hmmm, the example you post seems correct to me, the returned suggestion is really close to the term. What are you expecting here? The example is inconsistent with it returns the suggestion corresponding to the alphabets (ignoring the numbers) It looks like it's considering the numbers just fine, which is what makes the returned suggestion close to the term I think. Best, Erick On Tue, Feb 11, 2014 at 1:01 PM, Developer bbar...@gmail.com wrote: I have a strange issue with Autosuggest. Whenever I query for a keyword along with numbers (leading) it returns the suggestion corresponding to the alphabets (ignoring the numbers). I was under assumption that it will return an empty result back. I am not sure what I am doing wrong. Can someone help? *Query:* /autocomplete?qt=/lucidreq_type=auto_completespellcheck.maxCollations=10q=12342343243242gaspellcheck.count=10 *Result:* response lst name=responseHeader int name=status0/int int name=QTime1/int /lst lst name=spellcheck lst name=suggestions lst name=ga int name=numFound1/int int name=startOffset15/int int name=endOffset17/int arr name=suggestion strgalaxy/str /arr /lst str name=collation12342343243242galaxy/str /lst /lst /response *My field configuration is as below:* fieldType class=solr.TextField name=textSpell_word positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory enablePositionIncrements=true ignoreCase=true words=stopwords_autosuggest.txt/ /analyzer /fieldType *SolrConfig.xml* searchComponent class=solr.SpellCheckComponent name=autocomplete lst name=spellchecker str name=nameautocomplete/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldautocomplete_word/str str name=storeDirautocomplete/str str name=buildOnCommittrue/str float name=threshold.005/float /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/autocomplete lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionaryautocomplete/str str name=spellcheck.collatetrue/str str name=spellcheck.count10/str str name=spellcheck.onlyMorePopularfalse/str /lst arr name=components strautocomplete/str /arr /requestHandler -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Autosuggest-Strange-issue-with-leading-numbers-in-query-tp4116751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing question on individual field update
Update and add are basically the same thing if there's an existing document. There will be some performance consequence since you're getting the stored fields on the server as opposed to getting the full input from the external source and handing it to Solr. However, I know of at least one situation where the atomic update rate is sky-high and it works, so I wouldn't worry about it unless and until I saw a problem. Best, Erick On Tue, Feb 11, 2014 at 3:03 PM, Shawn Heisey s...@elyograg.org wrote: On 2/11/2014 2:37 PM, shamik wrote: Eric, Thanks for your reply. I should have given a better context. I'm currently running an incremental crawl daily on this particular source and indexing the documents. Incremental crawl looks for any change since last crawl date based on the document publish date. But, there's no way for me to know if a document has been deleted. To ensure that, I ran a full crawl on a weekend, which basically re-index the entire content. After the full index is over, I call a purge script, which deletes any content which is more than 24 hour old, based on the indextimestamp field. The issue with atomic update is that it doesn't alter the indextimstamp field. So even if I run a full crawl with atomic updates, the timestamp will stick to its old value. Unfortunately, I can't rely on another date field coming from the source as they are not consistent. That translates to the fact that I can't remove stale content. One possibility is this: When you send the atomic update to Solr, include a new value for the indextimestamp field. Another option: You can write a custom update processor plugin for Solr. When the custom code is used, it will be executed on each incoming document. Depending on what it finds in the update request, it can make appropriate changes, like updating indextimestamp. You can do pretty much anything. http://wiki.apache.org/solr/UpdateRequestProcessor Writing an update processor in Java typically gives the best results in terms of flexibility and performance, but there is also a way to use other programming languages: http://wiki.apache.org/solr/ScriptUpdateProcessor Thanks, Shawn
Re: Solr performance on a very huge data set
Can't answer that, there are just too many variables. Here's a helpful resource: http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Best, Erick On Tue, Feb 11, 2014 at 5:23 PM, neerajp neeraj_star2...@yahoo.com wrote: Hello Dear, I have 1000 GB of data that I want to index. Assuming I have enough space for storing the indexes in a single machine. *I would like to get an idea about Solr performance for searching an item from a huge data set. Do I need to use shards for improving the Solr search efficiency or it is OK to search without sharding ?* I will use SolrCloud for high availability and fault tolerance with the help of zoo-keeper. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-on-a-very-huge-data-set-tp4116792.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need feedback: Browsing and searching solr-user list emails
Hi Durgam, You are asking a hard question. Yes, the idea looks interesting as an experiment. Possibly even useful in some ways. And I love the fact that you are eating your own dogfood (running Solr). And the interface looks nice (I guess this is your hosted Nimeyo offering underneath). Yet, I am having troubles seeing it stick around long term. Here are my reasons: *) This oferring feels like an inverse of StackExchange. SE is a primary source of data and they actually get most of the search traffic from Google. This proposal has the data coming from somewhere else and is trying to add a search on top of it. *) Furthermore, the SE voting/participation is heavily gamified and they spend a lot of time and manpower to keeping the balance of that gamification vs. abuse. I think it is a lot harder to provide incentives to vote in your approach *) There are other dogfood-eating search websites. http://search-lucene.com/ is one of them. *) There are also other mailing-list navigational websites with gateway ability to post message in. They suck, both in interface and in monetisation around the interface. In fact, they feel like the SPAM farms similar to those republishing Wikipedia. I am not saying this is relevant to your effort directly, but it is an issue related to discovery of good search website in the sea of bad ones. search-lucene for example is discoverable because it is one of the search engines on the Apache website. Even then, it took me (at least) very long time to discover it. *) In general, discoverability is a b*tch (try to multiterm this, Solr! :-) as you need a very significant traction for people to use your site before it becomes useful to more people. A bit of a catch-22. Again, SE did it by having a large audience on StackOverflow and then branching off into topics that people on SO were also interested in. And even that was an issue (see area51 for how they do it). You have people (who read mailing list), but are they the people who need to search the archives? I think the mailing list is a more of a 'flow' interface to most of the people. *) You have Google Analytics - did you get much traction yet? I suspect no from the lack of replies on the mailing list. I would step back and evaluate: *) Who specifically is a target audience? I, for example, do star some posts on the mailing list because they are just so good that I will want to refer to them later. But, even then, I would have no incentive right now to do it in public. Nor would I do 3-4 steps necessary to go from email I like to some alternative interface to find the same email again just to vote for it. And how do I find my voted emails later? Requiring an account (to track) is even harder to swallow. *) Again, who specifically is a target audience? Is it beginners? Intermediates? Advanced? What are the pain point of those different group you are trying to solve. *) What can you offer to the first user before the voting actually works (bootstrap phase). Pure search? Others do that already. *) How would people find your service (SEO, etc). *) Why are you doing it. It may not be a lot of effort to set it up, but to actually grow any crowd-source resource is a significant task. What does this build towards that will make it sustainable for you. And, I really hope it is not page ads. *) From Nimeyo's home page, you are targeting enterprises; are you sure the offering maps to the public resource with dynamic transient audience the same way. Now, if you do want to help Solr community, that would be great. I am trying to do that in my own way and really welcome anybody try to assist beyond their own needs. Grow the community, and so on. Here is an example of how I thought of the above issues myself: *) I just released the full list of UpdateRequestProcessor Factories ( http://www.solr-start.com/update-request-processor/4.6.1/ ). *) This is information that anybody can discover for themselves, but it takes a lot searching and clicking and getting lost. I have discovered that problem on my own when writing my Solr book and it was stuck with me as a problem to be solved. So, I solved it (in a very basic way for this version) and I have more similar things on the way. *) My target audience, just as with my book, are people trying to skill up from the beginners to the intermediates. My goal is to reduce the barrier of entry to the more advanced Solr knowledge. *) My SEO (we'll see if it works) is to provide information that does not exist anywhere else in one place and to be discoverable when people search for the particular names of URP. *) I also have an incentive to keep it going (version 4.7, 4.8, other resources) because I want people to be on my mailing list for when I do the next REALLY exciting Solr project (Github-based interactive Solr training would be a strong hint). So, these resources are my bootstrapping strategy as well. Now, there is plenty of other things that can be done to assist Solr community. Some of them would
Re: Join Scoring
Hi Anand. Solr's JOIN query, {!join}, constant-scores. It's simpler and faster and more memory efficient (particularly the worse-case memory use) to implement the JOIN query without scoring, so that's why. Of course, you might want it to score and pay whatever penalty is involved. For that you'll need to write a Solr QueryParser that might use Lucene's join module which has scoring variants. I've taken this approach before. You asked a specific question about the purpose of JoinScorer when it doesn't actually score. Lucene's Query produces a Weight which in turn produces a Scorer that is a DocIdSetIterator plus it returns a score. So Queries have to have a Scorer to match any document even if the score is always 1. Solr does indeed have a lot of caching; that may be in play here when comparing against a quick attempt at using Lucene directly. In particular, the matching documents are likely to end up in Solr's DocumentCache. Returning stored fields that come back in search results are one of the more expensive things Lucene/Solr does. I also think you noted that the fields on documents from the from side of the query are not available to be returned in search results, just the to side. Yup; that's true. To remedy this, you might write a Solr SearchComponent that adds fields from the from side. That could be tricky to do; it would probably need to re-run the from-side query but filtered to the matching top-N documents being returned. ~ David anand chandak wrote Resending, if somebody can please respond. Thanks, Anand On 2/5/2014 6:26 PM, anand chandak wrote: Hi, Having a question on join score, why doesn't the solr join query return the scores. Looking at the code, I see there's JoinScorer defined in the JoinQParserPlugin class ? If its not used for scoring ? where is it actually used. Also, to evaluate the performance of solr join plugin vs lucene joinutil, I filed same join query against same data-set and same schema and in the results, I am always seeing the Qtime for Solr much lower then lucenes. What is the reason behind this ? Solr doesn't return scores could that cause so much difference ? My guess is solr has very sophisticated caching mechanism and that might be coming in play, is that true ? or there's difference in the way JOIN happens in the 2 approach. If I understand correctly both the implementation are using 2 pass approach - first all the terms from fromField and then returns all documents that have matching terms in a toField If somebody can throw some light, would highly appreciate. Thanks, Anand - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Join Scoring
Thanks David, really helpful response. You mentioned that if we have to add scoring support in solr then a possible approach would be to add a custom QueryParser, which might be taking Lucene's JOIN module. Curious, if it is possible instead to enhance existing solr's JoinQParserPlugin and add the the scoring support in the same class ? Do you think its feasible and recommended ? If yes, what would it take - in terms of code changes, any pointers ? Thanks, Anand On 2/12/2014 10:31 AM, David Smiley (@MITRE.org) wrote: Hi Anand. Solr's JOIN query, {!join}, constant-scores. It's simpler and faster and more memory efficient (particularly the worse-case memory use) to implement the JOIN query without scoring, so that's why. Of course, you might want it to score and pay whatever penalty is involved. For that you'll need to write a Solr QueryParser that might use Lucene's join module which has scoring variants. I've taken this approach before. You asked a specific question about the purpose of JoinScorer when it doesn't actually score. Lucene's Query produces a Weight which in turn produces a Scorer that is a DocIdSetIterator plus it returns a score. So Queries have to have a Scorer to match any document even if the score is always 1. Solr does indeed have a lot of caching; that may be in play here when comparing against a quick attempt at using Lucene directly. In particular, the matching documents are likely to end up in Solr's DocumentCache. Returning stored fields that come back in search results are one of the more expensive things Lucene/Solr does. I also think you noted that the fields on documents from the from side of the query are not available to be returned in search results, just the to side. Yup; that's true. To remedy this, you might write a Solr SearchComponent that adds fields from the from side. That could be tricky to do; it would probably need to re-run the from-side query but filtered to the matching top-N documents being returned. ~ David anand chandak wrote Resending, if somebody can please respond. Thanks, Anand On 2/5/2014 6:26 PM, anand chandak wrote: Hi, Having a question on join score, why doesn't the solr join query return the scores. Looking at the code, I see there's JoinScorer defined in the JoinQParserPlugin class ? If its not used for scoring ? where is it actually used. Also, to evaluate the performance of solr join plugin vs lucene joinutil, I filed same join query against same data-set and same schema and in the results, I am always seeing the Qtime for Solr much lower then lucenes. What is the reason behind this ? Solr doesn't return scores could that cause so much difference ? My guess is solr has very sophisticated caching mechanism and that might be coming in play, is that true ? or there's difference in the way JOIN happens in the 2 approach. If I understand correctly both the implementation are using 2 pass approach - first all the terms from fromField and then returns all documents that have matching terms in a toField If somebody can throw some light, would highly appreciate. Thanks, Anand - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spatial Score by overlap area
Hi, BBoxStrategy is still only in “trunk” (not the 4x branch). And furthermore… the Solr portion, a FieldType, is over in Spatial-Solr-Sandbox — https://github.com/ryantxu/spatial-solr-sandbox/blob/master/LSE/src/main/ja va/org/apache/solr/spatial/pending/BBoxFieldType.java It should be quite easy to port to 4x and put independently into a JAR file plug-in to Solr 4. It’s lacking better tests, and until your question I haven’t seen interest from users. Ryan McKinley ported it from GeoServer. ~ David On 2/10/14, 12:53 AM, geoport tb.rost...@gmail.com wrote: Hi, i am using solr 4.6 and i´ve indexed bounding boxes. Now, i want to test the area overlap sorting link http://de.slideshare.net/lucenerevolution/lucene-solr-4-spatial-extended- deep-dive (slide 23), have some of you an example for me?Thanks for helping me. -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-Score-by-overlap-area-tp4116439 .html Sent from the Solr - User mailing list archive at Nabble.com.
Unable to index mysql table
Hi I downloaded solr and without any changes in directory structure i just followed the solr wiki and tried to import mysql table but unable to do... Actualy Im using the directory as is in example folder but copied the contrib jar files and lib tags here and there where required.. Please help in indexing my mysql table... NOTE: Im using remote linux server by doing ssh and am able to start the solr server. --- Regards *Tarun Sharma*
Re: Unable to index mysql table
What's unable to do actually translates to? Are you having troubles writing a particular config file? Are you getting an error message? Are you getting only some of the data in? Tell us exactly where you are stuck. Better, google first for exactly what you are stuck with, maybe it's already been answered. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Feb 12, 2014 at 12:52 PM, Tarun Sharma tarunsharma1...@gmail.com wrote: Hi I downloaded solr and without any changes in directory structure i just followed the solr wiki and tried to import mysql table but unable to do... Actualy Im using the directory as is in example folder but copied the contrib jar files and lib tags here and there where required.. Please help in indexing my mysql table... NOTE: Im using remote linux server by doing ssh and am able to start the solr server. --- Regards *Tarun Sharma*
Re: Indexing question on individual field update
Thanks Eric and Shawn, appreciate your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116831.html Sent from the Solr - User mailing list archive at Nabble.com.