Re: issue with highlighting in solr 4.10.2
Hi Erick, The Contents field contains one sentence only and no watch exists in it. Plus we use quite large snippet size to surely cover the field. Dmitry On Sat, Jun 27, 2015 at 6:16 PM, Erick Erickson erickerick...@gmail.com wrote: Does watch exist in the Contents field somewhere outside the snippet size you've specified? Shot in the dark, Erick On Fri, Jun 26, 2015 at 3:22 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi, When highlighting hits for the following query: (+Contents:apple +Contents:watch) Contents:iphone I expect the standard solr highlighter to highlight either iphone or iphone AND apple, only if watch is present. However, solr highlights iphone along with only apple. Is this a bug or a known feature? Is there any way to debug the highlighter using solr admin? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
SolrCloud Document Update Problem
Hi, I setup a SolrCloud with 2 shards each is having 2 replicas with 3 zookeeper ensemble. We add and update documents from web app. While updating we delete the document and add same document with updated values with same unique id. I am facing a very strange issue that some time 2 documents have the same unique ID. One document with old values and another one with new values. It happens only we update the document. Please suggest or guide... Rgds
Re: Reading indexed data from solr 5.1.0 using admin/luke?
Not quite sure what you mean by compressed values. admin/luke doesn't show the results of the compression of the stored values, there's no way I know of to do that. Best, Erick On Mon, Jun 29, 2015 at 8:20 AM, dinesh naik dineshkumarn...@gmail.com wrote: Hi all, Is there a way to read the indexed data for field on which the analysis/processing has been done ? I know using admin GUI we can see field wise analysis But how can i get hold on the complete document using admin/luke? or any other way? For example, if i have 2 fields called name and compressedname. name has values like apple, green-apple,red-apple compressedname has values like apple,greenapple,redapple Even though i make both these field indexed=true and stored=true I am not able to see the compressed values using admin/luke?id=mydocid in response i see something like this- lst name=name str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst lst name=compressedname str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst -- Best Regards, Dinesh Naik
Re: optimize status
Steven: Yes, but First, here's Mike McCandles' excellent blog on segment merging: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html I think the third animation is the TieredMergePolicy. In short, yes an optimize will reclaim disk space. But as you update, this is done for you anyway. About the only time optimizing is at all beneficial is when you have a relatively static index. If you're continually updating documents, and by that I mean replacing some existing documents, then you'll immediately start generating holes in your index. And if you _do_ optimize, you wind up with a huge segment. And since the default policy tries to merge segments of roughly the same size, it accumulates deletes for quite a while before they merged away. And if you don't update existing docs or delete docs, then there's no wasted space anyway. Summer: First off, why do you care about not updating during optimizing? There's no good reason you have to worry about that, you can freely update while optimizing. But frankly I have to agree with Upayavira that on the face of it you're doing a lot of extra work. See above, but you optimize while indexing, so immediately you're rather defeating the purpose. Personally I'd only optimize relatively static indexes and, by definition, you're index isn't static since the second process is just waiting to modify it. Best, Erick On Mon, Jun 29, 2015 at 8:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer
RE: Correcting text at index time
Hi Markus Thanks for the reply. I'm already using the Synonyms filter and it is working fine (i.e., when I search for customer, it also returns documents containing cst.). What the synonyms filter does not do is to actually replace the word cst. with customer in the document. Just to be clearer: in the returned results, I do not want to see the word cst. any more (it should be permanently replaced with customer). I want to only see the expanded form. Cheers A. -- View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Document Update Problem
On Mon, Jun 29, 2015 at 4:37 PM, Amit Jha shanuu@gmail.com wrote: Hi, I setup a SolrCloud with 2 shards each is having 2 replicas with 3 zookeeper ensemble. We add and update documents from web app. While updating we delete the document and add same document with updated values with same unique id. I am not sure why you delete the document. If you use the same unique key and send the whole document again (with some other fields changed), Solr will automatically overwrite the old document with the new one. I am facing a very strange issue that some time 2 documents have the same unique ID. One document with old values and another one with new values. It happens only we update the document. Please suggest or guide... Rgds -- Regards, Shalin Shekhar Mangar.
Architectural advice questions on using Solr XML DataImport Handlers (and Nutch) for a Vertical Search engine.
Please bear with me here, I'm pretty new to Solr with most of me DB experience being of the relational variety. I'm planning a new project, which I believe Solr (and Nutch) will solve well. Although I've installed Solr 5.2 and Nutch 1.10 (on Centos) and tinkered about a bit, I'd be grateful for advice and tips regarding my plan. I'm looking to build a vertical search engine to cover a very specific and narrow dataset. Sources will number in the hundreds and mostly managed by hand, these will be a mixture of forums and product based e-commerce sites. For some of these I was hoping to leverage the SOLR DataImportHandler system with their RSS feeds primarily for the ease of acquiring clean, reasonably sanitised and well structured data. For the rest, I'm going to fall back to Nutch crawling them, with some heavy regulation via Regex of urls. So to sum up, a Solr DB populated through a couple of different ways, then search via some custom user facing PHP webpages. Finally a cronjob script would delete any docs older than X weeks, to keep on top of data retention. Does that sound sensible at all? Regarding RSS feeds:- Many only provide a limited number of recent items, however I'd like to retain items for many weeks. I've already discovered the clean=false param on DataImport, after wondering why old rss items vanished! Question 1) is there an easy way to filter items to import in the URLDataSource entity? Or is it best to go down route of XSLT preprocessing? Question 2) Multiple URLDataSources: reference all in one DataImport handler? Or have multiple DataImport handlers? What's the best approach to supplement imported data with additional static fields/keywords based associated with the source feed or crawled site? e.g. all docs from sites A, B C are of subcategory Foo. I'm guessing with RSS feeds this would be straightforward via the XSLT preprocessor. But for Nutch submitted docs - I've no idea? Scheduling import: Do people just cron up a curl post request (or shell execute of Nutch crawl script)? Or is there a more elegant solution available? Any other more general tips and advice on the above greatly appreciated. -- Arthur Yarwood
set the param [facet.offset] for EVERY [facet.pivot]
HI All:I need a pagenigation with facet offset. There are two or more fields in [facet.pivot], but only one value for [facet.offset], eg: facet.offset=10facet.pivot=field_1,field_2. In this condition, field_2 is 10's offset and then field_1 is 10's offset. But what I want is field_2 is 1's offset and field_1 is 10's offset. How can I fix this problem or try another way to complete? Any help is appreciated!
Reading indexed data from solr 5.1.0 using admin/luke?
Hi all, Is there a way to read the indexed data for field on which the analysis/processing has been done ? I know using admin GUI we can see field wise analysis But how can i get hold on the complete document using admin/luke? or any other way? For example, if i have 2 fields called name and compressedname. name has values like apple, green-apple,red-apple compressedname has values like apple,greenapple,redapple Even though i make both these field indexed=true and stored=true I am not able to see the compressed values using admin/luke?id=mydocid in response i see something like this- lst name=name str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst lst name=compressedname str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst -- Best Regards, Dinesh Naik
Correcting text at index time
Hi everyone I'm wondering if it's possible in Solr to correct text at indexing time, based on a synonyms-like list. This would be great for expanding undesirable abbreviations (for example, cst. instead of customer). I've been searching the Solr docs and the web quite thoroughly I believe, but haven't found anything to do this. I guess if there really isn't anything like this, I could implement it as a custom Filter... Thanks! A. -- View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Correcting text at index time
Hello - why not just use synonyms or StemmerOverrideFilter? Markus -Original message- From:hossmaa andreea.hossm...@gmail.com Sent: Monday 29th June 2015 14:08 To: solr-user@lucene.apache.org Subject: Correcting text at index time Hi everyone I'm wondering if it's possible in Solr to correct text at indexing time, based on a synonyms-like list. This would be great for expanding undesirable abbreviations (for example, cst. instead of customer). I've been searching the Solr docs and the web quite thoroughly I believe, but haven't found anything to do this. I guess if there really isn't anything like this, I could implement it as a custom Filter... Thanks! A. -- View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: optimize status
Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer
Re: SolrCloud Document Update Problem
It was because of the issues Rgds AJ On Jun 29, 2015, at 6:52 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Jun 29, 2015 at 4:37 PM, Amit Jha shanuu@gmail.com wrote: Hi, I setup a SolrCloud with 2 shards each is having 2 replicas with 3 zookeeper ensemble. We add and update documents from web app. While updating we delete the document and add same document with updated values with same unique id. I am not sure why you delete the document. If you use the same unique key and send the whole document again (with some other fields changed), Solr will automatically overwrite the old document with the new one. I am facing a very strange issue that some time 2 documents have the same unique ID. One document with old values and another one with new values. It happens only we update the document. Please suggest or guide... Rgds -- Regards, Shalin Shekhar Mangar.
Jetty Plus for Solr 4.10.4
We are planning to go to production with Solr 4.10.4. Documentation recommends to use full Jetty package that includes JettyPlus. I'm not able to find the instructions to do this. Can someone point me in the right direction? Thanks, Magesh
Re: optimize status
“Optimize” is a manual full merge. Solr automatically merges segments as needed. This also expunges deleted documents. We really need to rename “optimize” to “force merge”. Is there a Jira for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 5:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer
cursorMark and timeAllowed are mutually exclusive?
Hi list, while just trying cursorMark I got the following search response: error: { msg: Can not search using both cursorMark and timeAllowed, code: 400 } Yes, I'm using timeAllowed which is set in my requestHandler as invariant to 6 (60 seconds) as a limit to killer searches. Have nothing found in the ref guides, docs, wiki, examples about this mutually exclusive parameters. Is this a bug or a feature and if it is a feature, where is the sense of this? Regards Bernd
Re: cursorMark and timeAllowed are mutually exclusive?
On 6/29/2015 9:12 AM, Bernd Fehling wrote: while just trying cursorMark I got the following search response: error: { msg: Can not search using both cursorMark and timeAllowed, code: 400 } Yes, I'm using timeAllowed which is set in my requestHandler as invariant to 6 (60 seconds) as a limit to killer searches. Have nothing found in the ref guides, docs, wiki, examples about this mutually exclusive parameters. Is this a bug or a feature and if it is a feature, where is the sense of this? It appears to have been disallowed almost from the beginning of the cursorMark feature. It was not present in the first versions of the patch, but it was already incorporated before anything got committed to SVN. https://issues.apache.org/jira/browse/SOLR-5463 The reasons for the incompatibility are not clear from the issue notes, so either hossman or sarowe may need to comment about what makes the two features fundamentally incompatible, and that info needs to go into the documentation. Thanks, Shawn
RE: optimize status
Is there really a good reason to consolidate down to a single segment? Any incremental query performance benefit is tiny compared to the loss of managability. I.e. shouldn't segments _always_ be kept small enough to facilitate re-balancing data across shards? Even in non-cloud instances this is true. When a collection grows, you may want shard/split an existing index by adding a node and moving some segments around.Isn't this the direction Solr is going? With many, smaller segments, this is feasible. With one big segment, the collection must always be reindexed. Thus, optimize would mean, get rid of all deleted records and would, in fact, optimize queries by eliminating wasted I/O. Perhaps worth it for slowly changing indexes. Seems like the Tiered merge policy is 90% there ...Or am I all wet (again)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Monday, June 29, 2015 10:39 AM To: solr-user@lucene.apache.org Subject: Re: optimize status Optimize is a manual full merge. Solr automatically merges segments as needed. This also expunges deleted documents. We really need to rename optimize to force merge. Is there a Jira for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 5:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: Jetty Plus for Solr 4.10.4
On 6/29/2015 8:44 AM, Tarala, Magesh wrote: We are planning to go to production with Solr 4.10.4. Documentation recommends to use full Jetty package that includes JettyPlus. I'm not able to find the instructions to do this. Can someone point me in the right direction? I found the official page that talks about JettyPlus. https://wiki.apache.org/solr/SolrJetty Note at the top of the page where it says that info is outdated for Jetty 8. Solr has been using Jetty 8 since version 4.0-ALPHA -- for nearly three years now. Typical use cases for Solr do *not* require a full Jetty install. Even most non-typical use cases do not require it. Solr 4.10 includes the bin/solr script for startup, which runs the Jetty that's included in the Solr download. Solr 5.x makes those scripts even better. If you haven't made it to production yet, you should probably consider upgrading to Solr 5.2.1. If you are not going to use the Jetty included with Solr, then you're pretty much on your own. You can take the war file from the dist directory, the logging jars from the example/lib/ext directory, and the logging config from example/resources, and install it in most of the available servlet containers. Starting with 5.0, the included Jetty is the only officially supported way to start Solr, and the war is no longer included in the dist directory in the download. https://wiki.apache.org/solr/WhyNoWar Thanks, Shawn
Re: cursorMark and timeAllowed are mutually exclusive?
: Have nothing found in the ref guides, docs, wiki, examples about this mutually : exclusive parameters. : : Is this a bug or a feature and if it is a feature, where is the sense of this? The problem is that if a timeAllowed exceeded situation pops up, you won't get a nextCursorMark to use -- or the one you get might be wrong, and could trigger infinit looping. code doesn't really know about hte cursorMark code, so if a timeAllowed exceeded siutation pops up, you might not get a nextCursorMark in your response, which i considered unacceptible. if you ask for a cursorMark, you get a cursor mark. if you ask for a cursor mark and include other options that make it possible we can't do that, it's an error. With a bit of work, both could probably be supported in combination -- but for now it's untested, and thus unsupported, so we put in that error message to make it clear and to guard end users against the risk of nonsensical results. Yes, I'm using timeAllowed which is set in my requestHandler as invariant to 6 (60 seconds) as a limit to killer searches. Your best is porbably to confine your cursorMark searches to an alternate request handler, not used by your normal arbitrary queries, that doesn't have the timeAllowed invariant. -Hoss http://www.lucidworks.com/
Questions regarding autosuggest (Solr 5.2.1)
A friend and I are trying to develop some software using Solr in the background, and with that comes alot of changes. We're used to older versions (4.3 and below). We especially have problems with the autosuggest feature. This is the field definition (schema.xml) for our autosuggest field: field name=autosuggest type=autosuggest indexed=true stored=true required=false multiValued=true / ... copyField source=name dest=autosuggest / ... fieldType name=autosuggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=30/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Afterwards, we defined an autosuggest component to use this field, like this (solrconfig.xml): searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldsuggest/str str name=suggestAnalyzerFieldTypeautosuggest/str str name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst /searchComponent And add a requesthandler to test out the functionality: requestHandler name=/suggesthandler class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler However, trying to start the core that has this configuration, a long exception occurs, telling us this: Error in configuration: autosuggest is not defined in the schema Now, that seems to be wrong. Any idea how to fix that?
RE: Jetty Plus for Solr 4.10.4
Hi Shawn - Thank you for the quick and detailed response!! Good to hear that Jetty 8 installation with solr for typical uses does not need to be modified. I believe what we have is a typical use case. We will be installing solr on 3 nodes in our Hadoop cluster. Will use Hadoop's zookeeper. One collection with 3 shards and 2 replicas each. Have not benchmarked performance. So, may need more shards, nodes,... Data volume and user volumes are not very high. But we are using nested document structure. We are concerned that it may introduce performance issues. Will check it out. Regarding you recommendation to upgrade to Solr 5.2.1, we have Hortonworks HDP 2.2 in place and they support 4.10. Will revisit the decision. Thanks, Magesh -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Monday, June 29, 2015 11:50 AM To: solr-user@lucene.apache.org Subject: Re: Jetty Plus for Solr 4.10.4 On 6/29/2015 8:44 AM, Tarala, Magesh wrote: We are planning to go to production with Solr 4.10.4. Documentation recommends to use full Jetty package that includes JettyPlus. I'm not able to find the instructions to do this. Can someone point me in the right direction? I found the official page that talks about JettyPlus. https://wiki.apache.org/solr/SolrJetty Note at the top of the page where it says that info is outdated for Jetty 8. Solr has been using Jetty 8 since version 4.0-ALPHA -- for nearly three years now. Typical use cases for Solr do *not* require a full Jetty install. Even most non-typical use cases do not require it. Solr 4.10 includes the bin/solr script for startup, which runs the Jetty that's included in the Solr download. Solr 5.x makes those scripts even better. If you haven't made it to production yet, you should probably consider upgrading to Solr 5.2.1. If you are not going to use the Jetty included with Solr, then you're pretty much on your own. You can take the war file from the dist directory, the logging jars from the example/lib/ext directory, and the logging config from example/resources, and install it in most of the available servlet containers. Starting with 5.0, the included Jetty is the only officially supported way to start Solr, and the war is no longer included in the dist directory in the download. https://wiki.apache.org/solr/WhyNoWar Thanks, Shawn
RE: Reading indexed data from solr 5.1.0 using admin/luke?
Hi Eric, By compressed value I meant value of a field after removing special characters . In my example its -. Compressed form of red-apple is redapple . I wanted to know if we can see the analyzed version of fields . For example if I use ngram on a field , how do I see the analyzed values in index ? -Original Message- From: Erick Erickson erickerick...@gmail.com Sent: 29-06-2015 18:12 To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: Re: Reading indexed data from solr 5.1.0 using admin/luke? Not quite sure what you mean by compressed values. admin/luke doesn't show the results of the compression of the stored values, there's no way I know of to do that. Best, Erick On Mon, Jun 29, 2015 at 8:20 AM, dinesh naik dineshkumarn...@gmail.com wrote: Hi all, Is there a way to read the indexed data for field on which the analysis/processing has been done ? I know using admin GUI we can see field wise analysis But how can i get hold on the complete document using admin/luke? or any other way? For example, if i have 2 fields called name and compressedname. name has values like apple, green-apple,red-apple compressedname has values like apple,greenapple,redapple Even though i make both these field indexed=true and stored=true I am not able to see the compressed values using admin/luke?id=mydocid in response i see something like this- lst name=name str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst lst name=compressedname str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst -- Best Regards, Dinesh Naik
Re: optimize status
Thank you guys, this was very helpful. I was always under the impression that the index need to be optimize periodically to reclaim disk space otherwise the index will just keep on growing and growing (was that the case in Lucene 2.x and prior days?). I agree with Walter, renaming optimize to something else, even “force merge” is better. However, make sure it has the proper documentation explaining what it does and why it's not worthy for live data. Steve On Mon, Jun 29, 2015 at 12:37 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there really a good reason to consolidate down to a single segment? Any incremental query performance benefit is tiny compared to the loss of managability. I.e. shouldn't segments _always_ be kept small enough to facilitate re-balancing data across shards? Even in non-cloud instances this is true. When a collection grows, you may want shard/split an existing index by adding a node and moving some segments around.Isn't this the direction Solr is going? With many, smaller segments, this is feasible. With one big segment, the collection must always be reindexed. Thus, optimize would mean, get rid of all deleted records and would, in fact, optimize queries by eliminating wasted I/O. Perhaps worth it for slowly changing indexes. Seems like the Tiered merge policy is 90% there ...Or am I all wet (again)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Monday, June 29, 2015 10:39 AM To: solr-user@lucene.apache.org Subject: Re: optimize status Optimize is a manual full merge. Solr automatically merges segments as needed. This also expunges deleted documents. We really need to rename optimize to force merge. Is there a Jira for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 5:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
solr suggester build issues
Solr : 4.9.x , with simple solr cloud on jetty. JDK 1.7 num of replica : 4 , one replica for each shard num of shards : 1 Hi All, I have been facing below issues with solr suggester introduced in 4.7.x. Do any one have good working solution or buildOnCommit=true property is suggested not to use with index with more frequent softcommits as suggested in the documentation https://cwiki.apache.org/confluence/display/solr/Suggester So we have disabled this (buildOnCommit=false) and started using buildOnOptimize=true, which was not helping us to have latest document suggestion (with frequent softcommits), as hardly there was one optimize each day. (we have default optimize setting in solrconfig) So we have disabled buildOnOptimize (buildOnOptimize=false) As suggested in the documentation, as of now, we came up with cron jobs to build the suggester for every hour. These jobs are doing their job, i.e, we are having the latest suggestions available every hour, below are issues that we have this implementation. *Issue#1* : Suggest built url i.e, *http://$solrnode:8983/solr/collection1/suggest?suggest.build=true* if issued to one replica of solr cloud does not build suggesters in all of the replicas in solrcloud. Resolution: For which we have separate cron jobs on each of the solr instance having the build call to build the suggester, below is the raw pictorial representation of this impl (which is not the best implementation which has many flaws) *http://$solrnode:8983/solr/collection1/suggest?suggest.build=true* * |* * |-- suggestcron.job.sh http://suggestcron.job.sh (on solr1.aws.instance)* *http://$solrnode:8983/solr/collection1/suggest?suggest.build=true* * |* * |-- suggestcron.job.sh http://suggestcron.job.sh (on solr2.aws.instance)* * .. similar for other solr nodes* * We will be coming up with single script to go this for all collection later.* we were bit happy that we are having a updated suggester in all of the instances, *which is not!* *The issue#2 the suggester built on all solr nodes were not consistent as the solr core in each solr replica have difference in max-docs and num-docs * *(which is quiet normal **with frequent softcommits , when updates mostly have the same documents updated with different data, **i guess , correct me if i'm wrong )* when we query curl -i http:// $solrnode:8983/solr/liveaodfuture/suggest?q=Nirvanawt=jsonindent=true one of the solr node returns { responseHeader:{ status:0, QTime:0}, suggest:{ AnalyzingSuggester:{ Nirvana:{ numFound:1, suggestions:[{ term:nirvana, weight:6, payload:}]}}, DictionarySuggester:{ Nirvana:{ numFound:0, suggestions:[] /admin/luke/collection/ call status index:{ numDocs:90564, maxDoc:94583, deletedDocs:4019, ...} while other 3 solr node returns { responseHeader:{ status:0, QTime:1}, suggest:{ AnalyzingSuggester:{ Nirvana:{ numFound:2, suggestions:[{ term:nirvana, weight:163, payload:}, * {* *term:nirvana cover,* *weight:11,* *payload:}]}},* DictionarySuggester:{ Nirvana:{ numFound:0, suggestions:[] /admin/luke/collection/ call status on other 3 solr nodes... which have different maxDoc that the above solr node. index:{ numDocs:90564, maxDoc:156760, } when i check the built time for suggest directory of the collection on each solr node have the same time ls -lah /mnt/solrdrive/solr/cores/*/data/suggest_analyzing/* -rw-r--r-- 1 root root 3.0M May 20 16:00 /mnt/solrdrive/solr/cores/collection1_shard1_replica3/data/suggest_analyzing/wfsta.bin Questions: Does the suggester built url i.e, *http://$solrnode:8983/solr/collection1/suggest?suggest.build=true *consider maxdocs or deleted docs also? Does the suggester built from i.e, *solr/collection1/suggest?suggest.build=true *is different from buildOnCommit=true property ? Do any one have better solution to keep the suggester current with contents in the index with more frequent softcommits? Does solr have any component like scheduler like cron scheduler to schedule the suggest build and scheduling the optimize on daily basis ? *Thanks,* *Rajesh**.*
RE: optimize status
Is there really a good reason to consolidate down to a single segment? Archiving (as one example). Come July 1, the collection for log entries/transactions in June will never be changed, so optimizing is actually a good thing to do. Kind of getting away from OP's question on this, but I don't think the ability to move data between shards in SolrCloud (such as shard splitting) has much to do with the Lucene segments under the hood. I'm just guessing, but I'd think the main issue with shard splitting would be to ensure that document route ranges are handled properly, and I don't think the value used for routing has anything to do with what segment they happen to be stored into. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Monday, June 29, 2015 11:38 AM To: solr-user@lucene.apache.org Subject: RE: optimize status Is there really a good reason to consolidate down to a single segment? Any incremental query performance benefit is tiny compared to the loss of managability. I.e. shouldn't segments _always_ be kept small enough to facilitate re-balancing data across shards? Even in non-cloud instances this is true. When a collection grows, you may want shard/split an existing index by adding a node and moving some segments around.Isn't this the direction Solr is going? With many, smaller segments, this is feasible. With one big segment, the collection must always be reindexed. Thus, optimize would mean, get rid of all deleted records and would, in fact, optimize queries by eliminating wasted I/O. Perhaps worth it for slowly changing indexes. Seems like the Tiered merge policy is 90% there ...Or am I all wet (again)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Monday, June 29, 2015 10:39 AM To: solr-user@lucene.apache.org Subject: Re: optimize status Optimize is a manual full merge. Solr automatically merges segments as needed. This also expunges deleted documents. We really need to rename optimize to force merge. Is there a Jira for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 5:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: optimize status
Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Question, Toke: in your immutable cases, don't the benefits of optimizing come mostly from eliminating deleted records? Not for us. We have about 1 deleted document for every 1000 or 10.000 standard documents. Is there any material difference in heap, CPU, etc. between 1, 5 or 10 segments? I.e. at how many segments/shard do you see a noticeable performance hit? It really is either 1 or more than 1 segment, coupled with 0 deleted records or more than 0. Having 1 segment means that String faceting benefits from not having to map between segment ordinals and global ordinals. That's a speed increase (just a null check instead of a memory lookup) as well as a heap requirement reduction: We save 2GB+ heap per shard on that account (our current heap size is 8GB). Granted, we facet on 600M values for one of the fields, which I don't think is very common. 0 deleted records is related as the usual bitmap of deleted documents is null, meaning faster checks. Most of the performance benefit probably comes from the freed memory. We have 25 shards/machine, so sparing 2GB gives us an extra 50GB of disk cache. The performance increase for that is 20-40%, guesstimated from some previous tests where we varied the disk cache size. I doubt that there is much difference between 2, 5, 10 or even 20 segments. The persons at UKWA are running some tests on different degrees of optimization of their 30 shard TB-class index. You'll have to dig a bit, but there might be relevant results: https://github.com/ukwa/shine/tree/master/python/test-logs Also, I curious if you have experimented much with the maxMergedSegmentMB and reclaimDeletesWeight properties of the TieredMergePolicy? I have zero experience with that: We build the shards one at a time and don't touch them after that. 90% of our building power goes to Tika analysis, so there hasn't been a apparent need for tuning Solr's indexing. - Toke Eskildsen
Re: optimize status
Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there really a good reason to consolidate down to a single segment? In the scenario spawning this thread it does not seem to be the best choice. Speaking more broadly there are Solr setups out there that deals with immutable data, often tied to a point in time, e.g. log data. We have such a setup (harvested web resources) and are able to lower heap requirements significantly and increase speed by building fully optimized and immutable shards. Any incremental query performance benefit is tiny compared to the loss of managability. True in many cases and I agree that the Optimize-wording is a bit of a trap. While technically correct, it implies that one should do it occasionally to keep any index fit. A different wording and maybe a tooltip saying something like Only recommended for non-changing indexes might be better. Turning it around: To minimize the risk of occasional performance-degrading large merges, one might want an index where all the shards are below a certain size. Splitting larger shards into smaller ones would in that case also be an optimization, just towards a different goal. - Toke Eskildsen
Re: Reading indexed data from solr 5.1.0 using admin/luke?
Use the schema browser on the admin UI, and click the load term info button. It'll show you the terms in your index. You can also use the analysis tab which will show you how it would tokenise stuff for a specific field. Upayavira On Mon, Jun 29, 2015, at 06:53 PM, Dinesh Naik wrote: Hi Eric, By compressed value I meant value of a field after removing special characters . In my example its -. Compressed form of red-apple is redapple . I wanted to know if we can see the analyzed version of fields . For example if I use ngram on a field , how do I see the analyzed values in index ? -Original Message- From: Erick Erickson erickerick...@gmail.com Sent: 29-06-2015 18:12 To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: Re: Reading indexed data from solr 5.1.0 using admin/luke? Not quite sure what you mean by compressed values. admin/luke doesn't show the results of the compression of the stored values, there's no way I know of to do that. Best, Erick On Mon, Jun 29, 2015 at 8:20 AM, dinesh naik dineshkumarn...@gmail.com wrote: Hi all, Is there a way to read the indexed data for field on which the analysis/processing has been done ? I know using admin GUI we can see field wise analysis But how can i get hold on the complete document using admin/luke? or any other way? For example, if i have 2 fields called name and compressedname. name has values like apple, green-apple,red-apple compressedname has values like apple,greenapple,redapple Even though i make both these field indexed=true and stored=true I am not able to see the compressed values using admin/luke?id=mydocid in response i see something like this- lst name=name str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst lst name=compressedname str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst -- Best Regards, Dinesh Naik
Re: Questions regarding autosuggest (Solr 5.2.1)
Try not putting it in double quotes? Best, Erick On Mon, Jun 29, 2015 at 12:22 PM, Thomas Michael Engelke thomas.enge...@posteo.de wrote: A friend and I are trying to develop some software using Solr in the background, and with that comes alot of changes. We're used to older versions (4.3 and below). We especially have problems with the autosuggest feature. This is the field definition (schema.xml) for our autosuggest field: field name=autosuggest type=autosuggest indexed=true stored=true required=false multiValued=true / ... copyField source=name dest=autosuggest / ... fieldType name=autosuggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=30/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Afterwards, we defined an autosuggest component to use this field, like this (solrconfig.xml): searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldsuggest/str str name=suggestAnalyzerFieldTypeautosuggest/str str name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst /searchComponent And add a requesthandler to test out the functionality: requestHandler name=/suggesthandler class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler However, trying to start the core that has this configuration, a long exception occurs, telling us this: Error in configuration: autosuggest is not defined in the schema Now, that seems to be wrong. Any idea how to fix that?
Re: Correcting text at index time
Yes, do this in an update request processor before it gets to the analyzer chain. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 3:19 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, very hard to do currently. The _point_ of stored fields is that an exact, verbatim copy of the input is returned in fl lists and this is violating that promise. I suppose some kind of custom update processor could work, but it's really roll your own funcitonality I think. Best, Erick On Mon, Jun 29, 2015 at 8:38 AM, hossmaa andreea.hossm...@gmail.com wrote: Hi Markus Thanks for the reply. I'm already using the Synonyms filter and it is working fine (i.e., when I search for customer, it also returns documents containing cst.). What the synonyms filter does not do is to actually replace the word cst. with customer in the document. Just to be clearer: in the returned results, I do not want to see the word cst. any more (it should be permanently replaced with customer). I want to only see the expanded form. Cheers A. -- View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: optimize status
For the sake of history, somewhere around Solr/Lucene 3.2 a new MergePolicy was introduced. The old one merged simply based upon age, or index generation, meaning the older the segment, the less likely it would get merged, hence needing optimize to clear out deletes from your older segments. The new MergePolicy, the TieredMergePolicy, uses a more intelligent algorithm to decide which segments to merge, and is the single reason why optimization isn't recommended anymore. According to the javadocs: For normal merging, this policy first computes a budget of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (pro-rating by percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the skew of the merge (size of largest segment divided by smallest segment), total merge size and percent deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored. If a merge will produce a segment that's larger than setMaxMergedSegmentMB(double), then the policy will merge fewer segments (down to 1 at once, if that one has deletions) to keep the segment size under budget. Upayavira On Mon, Jun 29, 2015, at 08:55 PM, Toke Eskildsen wrote: Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there really a good reason to consolidate down to a single segment? In the scenario spawning this thread it does not seem to be the best choice. Speaking more broadly there are Solr setups out there that deals with immutable data, often tied to a point in time, e.g. log data. We have such a setup (harvested web resources) and are able to lower heap requirements significantly and increase speed by building fully optimized and immutable shards. Any incremental query performance benefit is tiny compared to the loss of managability. True in many cases and I agree that the Optimize-wording is a bit of a trap. While technically correct, it implies that one should do it occasionally to keep any index fit. A different wording and maybe a tooltip saying something like Only recommended for non-changing indexes might be better. Turning it around: To minimize the risk of occasional performance-degrading large merges, one might want an index where all the shards are below a certain size. Splitting larger shards into smaller ones would in that case also be an optimization, just towards a different goal. - Toke Eskildsen
Re: Correcting text at index time
Hmmm, very hard to do currently. The _point_ of stored fields is that an exact, verbatim copy of the input is returned in fl lists and this is violating that promise. I suppose some kind of custom update processor could work, but it's really roll your own funcitonality I think. Best, Erick On Mon, Jun 29, 2015 at 8:38 AM, hossmaa andreea.hossm...@gmail.com wrote: Hi Markus Thanks for the reply. I'm already using the Synonyms filter and it is working fine (i.e., when I search for customer, it also returns documents containing cst.). What the synonyms filter does not do is to actually replace the word cst. with customer in the document. Just to be clearer: in the returned results, I do not want to see the word cst. any more (it should be permanently replaced with customer). I want to only see the expanded form. Cheers A. -- View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Correcting text at index time
The regex replace processor can be used to do this: https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -- Jack Krupansky On Mon, Jun 29, 2015 at 6:20 PM, Walter Underwood wun...@wunderwood.org wrote: Yes, do this in an update request processor before it gets to the analyzer chain. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 3:19 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, very hard to do currently. The _point_ of stored fields is that an exact, verbatim copy of the input is returned in fl lists and this is violating that promise. I suppose some kind of custom update processor could work, but it's really roll your own funcitonality I think. Best, Erick On Mon, Jun 29, 2015 at 8:38 AM, hossmaa andreea.hossm...@gmail.com wrote: Hi Markus Thanks for the reply. I'm already using the Synonyms filter and it is working fine (i.e., when I search for customer, it also returns documents containing cst.). What the synonyms filter does not do is to actually replace the word cst. with customer in the document. Just to be clearer: in the returned results, I do not want to see the word cst. any more (it should be permanently replaced with customer). I want to only see the expanded form. Cheers A. -- View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: optimize status
Question, Toke: in your immutable cases, don't the benefits of optimizing come mostly from eliminating deleted records? Is there any material difference in heap, CPU, etc. between 1, 5 or 10 segments? I.e. at how many segments/shard do you see a noticeable performance hit? Also, I curious if you have experimented much with the maxMergedSegmentMB and reclaimDeletesWeight properties of the TieredMergePolicy? For frequently updated indexes, would setting maxMergedSegmentMB lower (say 512 or 1024 MB, depending on total index size) and reclaimDeletesWeight higher (say 2.5?) be a good best practice? -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Monday, June 29, 2015 3:56 PM To: solr-user@lucene.apache.org Subject: Re: optimize status Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there really a good reason to consolidate down to a single segment? In the scenario spawning this thread it does not seem to be the best choice. Speaking more broadly there are Solr setups out there that deals with immutable data, often tied to a point in time, e.g. log data. We have such a setup (harvested web resources) and are able to lower heap requirements significantly and increase speed by building fully optimized and immutable shards. Any incremental query performance benefit is tiny compared to the loss of managability. True in many cases and I agree that the Optimize-wording is a bit of a trap. While technically correct, it implies that one should do it occasionally to keep any index fit. A different wording and maybe a tooltip saying something like Only recommended for non-changing indexes might be better. Turning it around: To minimize the risk of occasional performance-degrading large merges, one might want an index where all the shards are below a certain size. Splitting larger shards into smaller ones would in that case also be an optimization, just towards a different goal. - Toke Eskildsen * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: Reading indexed data from solr 5.1.0 using admin/luke?
You can also use the TermsComponent, that'll read the values from the indexed fields.That gets the raw terms, they aren't grouped. But you don't get the document. Reconstructing the doc from the postings lists is actually quite tedious. The Luke program (not request handler) has a function that does this, it's not fast though, more for troubleshooting than trying to do anything in a production environment. That said, I'm not quite sure what the current state of Luke is... Best, Erick On Mon, Jun 29, 2015 at 5:25 PM, Upayavira u...@odoko.co.uk wrote: Use the schema browser on the admin UI, and click the load term info button. It'll show you the terms in your index. You can also use the analysis tab which will show you how it would tokenise stuff for a specific field. Upayavira On Mon, Jun 29, 2015, at 06:53 PM, Dinesh Naik wrote: Hi Eric, By compressed value I meant value of a field after removing special characters . In my example its -. Compressed form of red-apple is redapple . I wanted to know if we can see the analyzed version of fields . For example if I use ngram on a field , how do I see the analyzed values in index ? -Original Message- From: Erick Erickson erickerick...@gmail.com Sent: 29-06-2015 18:12 To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: Re: Reading indexed data from solr 5.1.0 using admin/luke? Not quite sure what you mean by compressed values. admin/luke doesn't show the results of the compression of the stored values, there's no way I know of to do that. Best, Erick On Mon, Jun 29, 2015 at 8:20 AM, dinesh naik dineshkumarn...@gmail.com wrote: Hi all, Is there a way to read the indexed data for field on which the analysis/processing has been done ? I know using admin GUI we can see field wise analysis But how can i get hold on the complete document using admin/luke? or any other way? For example, if i have 2 fields called name and compressedname. name has values like apple, green-apple,red-apple compressedname has values like apple,greenapple,redapple Even though i make both these field indexed=true and stored=true I am not able to see the compressed values using admin/luke?id=mydocid in response i see something like this- lst name=name str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst lst name=compressedname str name=typestring/str str name=schemaITS--/str str name=flagsITS--/str str name=valueGREEN-APPLE/str str name=internalGREEN-APPLE/str float name=boost1.0/float int name=docFreq0/int /lst -- Best Regards, Dinesh Naik
RE: optimize status
Hi Garth, Yes, I'm straying from OP's question (I think Steve is all set). But his question, quite naturally, comes up often and a similar discussion ensues each time. I take your point about shards and segments being different things. I understand that the hash ranges per segment are not kept in ZK. I guess I wish they were. In this regard, I liked Mongodb, uses a 2-level sharding scheme. Each shard manages a list of chunks, each has its own hash range which is kept in the cluster state. If data needs to be balanced across nodes, it works at the chunk level. No record/doc level I/O is necessary. Much more targeted and only the data that needs to move is touched. Solr does most things better than Mongo, imo. But this is one area where the Mongo got it right. As for your example, what benefit does an application gain by reducing 10 segments, say, down to 1? Even if the index never changes? The gain _might_ be measurable, but it will be small compared to performance gains that can be had by maintaining a good data balance across nodes. Your example is based on implicit routing. So dynamic management of shards is less applicable. I just hope you get similar volumes of data every year. Otherwise, some years will perform better than others due to unbalanced data distribution! best, Charlie -Original Message- From: Garth Grimm [mailto:gdgr...@yahoo.com.INVALID] Sent: Monday, June 29, 2015 1:15 PM To: solr-user@lucene.apache.org Subject: RE: optimize status Is there really a good reason to consolidate down to a single segment? Archiving (as one example). Come July 1, the collection for log entries/transactions in June will never be changed, so optimizing is actually a good thing to do. Kind of getting away from OP's question on this, but I don't think the ability to move data between shards in SolrCloud (such as shard splitting) has much to do with the Lucene segments under the hood. I'm just guessing, but I'd think the main issue with shard splitting would be to ensure that document route ranges are handled properly, and I don't think the value used for routing has anything to do with what segment they happen to be stored into. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Monday, June 29, 2015 11:38 AM To: solr-user@lucene.apache.org Subject: RE: optimize status Is there really a good reason to consolidate down to a single segment? Any incremental query performance benefit is tiny compared to the loss of managability. I.e. shouldn't segments _always_ be kept small enough to facilitate re-balancing data across shards? Even in non-cloud instances this is true. When a collection grows, you may want shard/split an existing index by adding a node and moving some segments around.Isn't this the direction Solr is going? With many, smaller segments, this is feasible. With one big segment, the collection must always be reindexed. Thus, optimize would mean, get rid of all deleted records and would, in fact, optimize queries by eliminating wasted I/O. Perhaps worth it for slowly changing indexes. Seems like the Tiered merge policy is 90% there ...Or am I all wet (again)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Monday, June 29, 2015 10:39 AM To: solr-user@lucene.apache.org Subject: Re: optimize status Optimize is a manual full merge. Solr automatically merges segments as needed. This also expunges deleted documents. We really need to rename optimize to force merge. Is there a Jira for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 5:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger
SOLR 5.1.0 DB dataimport handler from orientdb
Hi everyone ! I want to import data from orientdb in solr 5.1.0. here is my configurations *data-config.xml* dataConfig dataSource type=JdbcDataSource driver=com.orientechnologies.orient.jdbc.OrientJdbcDriver url=jdbc:orient:remote:localhost/emallates_combine user=root password=root batchSize=-1/ document entity name=item query=select * from sellings deltaQuery=select * from sellings where updatedAt '${dataimporter.last_index_time}' field name=id column=price / field name=title column=status / /entity /document /dataConfig *JDBC* driver link http://orientdb.com/download/ and I paste this driver in* {solr_install_dir}/dist/orientdb-jdbc-2.0.5.jar* my configuration is not showing any error nor any output. here is log of solr after full/delta import call. INFO - 2015-06-29 12:37:24.894; [ DB] org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration: db-data-config.xml INFO - 2015-06-29 12:37:24.899; [ DB] org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded successfully INFO - 2015-06-29 12:37:24.900; [ DB] org.apache.solr.core.SolrCore; [DB] webapp=/solr path=/dataimport params={debug=falseoptimize=falseindent=truecommit=trueclean=truewt=jsoncommand=full-importverbose=false} status=0 QTime=7 INFO - 2015-06-29 12:37:24.902; [ DB] org.apache.solr.handler.dataimport.DataImporter; Starting Full Import *WARN - 2015-06-29 12:37:24.912; [ DB] org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read: dataimport.properties* INFO - 2015-06-29 12:37:24.914; [ DB] org.apache.solr.core.SolrCore; [DB] webapp=/solr path=/dataimport params={indent=truewt=jsoncommand=status_=1435567044911} status=0 QTime=1 INFO - 2015-06-29 12:37:24.942; [ DB] org.apache.solr.handler.dataimport.JdbcDataSource$1; Creating a connection for entity item with URL: jdbc:orient:remote:localhost/emallates_combine INFO - 2015-06-29 12:37:24.942; [ DB] org.apache.solr.update.processor.LogUpdateProcessor; [DB] webapp=/solr path=/dataimport params={debug=falseoptimize=falseindent=truecommit=trueclean=truewt=jsoncommand=full-importverbose=false} status=0 QTime=7 {deleteByQuery=*:* (-1505301149686693888)} 0 49 INFO - 2015-06-29 12:37:32.992; [ DB] org.apache.solr.core.SolrCore; [DB] webapp=/solr path=/dataimport params={wt=jsoncommand=abort_=1435567052987} status=0 QTime=1 INFO - 2015-06-29 12:37:33.000; [ DB] org.apache.solr.core.SolrCore; [DB] webapp=/solr path=/dataimport params={indent=truewt=jsoncommand=status_=1435567052997} status=0 QTime=0 SOLR is not importing any data. What am I doing wrong ? and second why am I getting above warming ? Thank you Regards Nauman Ramzan
Re: optimize status
I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer
Re: need advice on parent child mulitple category
hello any advice please -- View this message in context: http://lucene.472066.n3.nabble.com/need-advice-on-parent-child-mulitple-category-tp4214140p4214602.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: optimize status
Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer
Re: need advice on parent child mulitple category
http://wiki.apache.org/solr/HierarchicalFaceting On Mon, Jun 29, 2015 at 11:27 AM, Darniz rnizamud...@edmunds.com wrote: hello any advice please -- View this message in context: http://lucene.472066.n3.nabble.com/need-advice-on-parent-child-mulitple-category-tp4214140p4214602.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: optimize status
Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer
RE: optimize status
I see what you mean. Many thanks for the details. -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Monday, June 29, 2015 6:36 PM To: solr-user@lucene.apache.org Subject: Re: optimize status Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Question, Toke: in your immutable cases, don't the benefits of optimizing come mostly from eliminating deleted records? Not for us. We have about 1 deleted document for every 1000 or 10.000 standard documents. Is there any material difference in heap, CPU, etc. between 1, 5 or 10 segments? I.e. at how many segments/shard do you see a noticeable performance hit? It really is either 1 or more than 1 segment, coupled with 0 deleted records or more than 0. Having 1 segment means that String faceting benefits from not having to map between segment ordinals and global ordinals. That's a speed increase (just a null check instead of a memory lookup) as well as a heap requirement reduction: We save 2GB+ heap per shard on that account (our current heap size is 8GB). Granted, we facet on 600M values for one of the fields, which I don't think is very common. 0 deleted records is related as the usual bitmap of deleted documents is null, meaning faster checks. Most of the performance benefit probably comes from the freed memory. We have 25 shards/machine, so sparing 2GB gives us an extra 50GB of disk cache. The performance increase for that is 20-40%, guesstimated from some previous tests where we varied the disk cache size. I doubt that there is much difference between 2, 5, 10 or even 20 segments. The persons at UKWA are running some tests on different degrees of optimization of their 30 shard TB-class index. You'll have to dig a bit, but there might be relevant results: https://github.com/ukwa/shine/tree/master/python/test-logs Also, I curious if you have experimented much with the maxMergedSegmentMB and reclaimDeletesWeight properties of the TieredMergePolicy? I have zero experience with that: We build the shards one at a time and don't touch them after that. 90% of our building power goes to Tika analysis, so there hasn't been a apparent need for tuning Solr's indexing. - Toke Eskildsen * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: optimize status
Hi Upayavira and Erick, There are two things we are talking about here. First: Why am I optimizing? If I don’t our SEARCH (NOT INDEXING) performance is 100% worst. The problem lies in the number of total segments. We have to have max segments 1 or 2. I have done intensive performance related tests around number of segments, merge factor or changing the Merge policy. Second: Solr does not perform better for me without an optimize. So now that I have to optimize the second issue is updating concurrently during an optimize. If I update when an optimize is happening the optimize takes 5 times as long as the normal optimize. So is there any way other than creating a postOptimize hook and writing the status in a file and somehow making it available to the indexer. All of this just sounds traumatic :) Thanks Summer On Jun 29, 2015, at 5:40 AM, Erick Erickson erickerick...@gmail.com wrote: Steven: Yes, but First, here's Mike McCandles' excellent blog on segment merging: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html I think the third animation is the TieredMergePolicy. In short, yes an optimize will reclaim disk space. But as you update, this is done for you anyway. About the only time optimizing is at all beneficial is when you have a relatively static index. If you're continually updating documents, and by that I mean replacing some existing documents, then you'll immediately start generating holes in your index. And if you _do_ optimize, you wind up with a huge segment. And since the default policy tries to merge segments of roughly the same size, it accumulates deletes for quite a while before they merged away. And if you don't update existing docs or delete docs, then there's no wasted space anyway. Summer: First off, why do you care about not updating during optimizing? There's no good reason you have to worry about that, you can freely update while optimizing. But frankly I have to agree with Upayavira that on the face of it you're doing a lot of extra work. See above, but you optimize while indexing, so immediately you're rather defeating the purpose. Personally I'd only optimize relatively static indexes and, by definition, you're index isn't static since the second process is just waiting to modify it. Best, Erick On Mon, Jun 29, 2015 at 8:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer