Re: Show all fields in Solr highlighting output
Hi Edwin, I think Highlighting Behaviour of those types shifts over time. May be we should do the reverse. Move snippets to main response: https://issues.apache.org/jira/browse/SOLR-3479 Ahmet On Thursday, June 11, 2015 11:23 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Ahmet, I've tried that, but it's still not able to show. Those fields are actually of type=float, type=date and type=int. By default those field type are not able to be highlighted? Regards, Edwin On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Edwin, hl.alternateField is probably what you are looking for. ahmet On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Is it possible to list all the fields in the highlighting portion in the output? Currently,even when I str name=hl.fl*/str, it only shows fields where highlighting is possible, and fields which highlighting is not possible is not shown. I would like to have the output where all the fields, regardless if highlighting is possible or not, to be shown together. Regards, Edwin
Re: Problem with german hyphenated words not being found
The next thing to do is add debugQuery=true to your URL (or enable it in the query pane of the admin UI). Then look for the parsed query info. On the standard text_en field which includes an English stop word filter, I ran a query on Jack and Jill's House which showed this output: rawquerystring: text_en:(Jack and Jill's House), querystring: text_en:(Jack and Jill's House), parsedquery: text_en:jack text_en:jill text_en:hous, parsedquery_toString: text_en:jack text_en:jill text_en:hous, You can see that the parsed query is formed *after* analysis, so you can see exactly what is being queried for. Also, as a corollary to this, you can use the schema browser (or faceting for that matter) to view what terms are being indexed, to see if they should match. HTH Upayavira Am 11.06.2015 12:00 schrieb Upayavira: Have you used the analysis tab in the admin UI? You can type in sentences for both index and query time and see how they would be analysed by various fields/field types. Once you have got index time and query time to result in the same tokens at the end of the analysis chain, you should start seeing matches in your queries. Upayavira On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote: Hey, in german, you can string most nouns together by using hyphens, like this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer for industrial use Here [1[1]], you can see me querying Industrieanhänger from the name field (name:Industrieanhänger), to make sure the index actually contains the word. Our data is structured that products are listed without the hyphen. Now, customers can come around and use the hyphenated version as a search term (i.e.industrie-anhänger), and of course we want them to find what they are looking for. I've set it up so that the WordDelimiterFilterFactory uses catenateWords=1, so that these words are catenated. An analysis of Industrieanhänger as index and industrie-anhänger as query can be seen here [2[2]]. You can see that both word parts are found. However, querying for industrie- anhänger does not yield results, only when the hyphen is removed, as you can see here [3[3]]. I'm not sure how to proceed from here, as the results of the analysis have so far always lined up with what I could see when querying. Here's the schema definition for text, the field type for the name field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ -- filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've also thought it might be a problem with URL encoding not encoding the hyphen, but replacing it with %2D didn't change the outcome (and was probably wrong anyway). Any help is greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [2] http://i.imgur.com/H0AhEsF.png [3] http://imgur.com/dzmMe7t Links: 1. http://imgur.com/2oEC5vz 2. http://i.imgur.com/H0AhEsF.png 3. http://imgur.com/dzmMe7t
Re: DocTransformers for restructuring output, e.g. Highlighting
Yes! It only needs to be done! On Thu, Jun 11, 2015, at 11:38 AM, Ahmet Arslan wrote: Hi Upayavira, I was going to suggest SOLR-3479 to Edwin, I saw your old post. Regarding your suggestion, there is an existing ticket : https://issues.apache.org/jira/browse/SOLR-3479 I think SOLR-7665 is also relevant to your question. Ahmet On Sunday, June 23, 2013 9:54 PM, Upayavira u...@odoko.co.uk wrote: I've just taken a peek at the src for DocTransformers. They get given a TransformContext. That context contains the query and a few other bits and pieces. If it contained the response, DocTransformers would be able to do output restructuring. The best example is hit highlighting. If you did: hl=onhl.fl=namefl=*,[highlight:name] you would no longer need to seek the highlighted strings in another part of the output. The conceptual downside of this approach is that we might expect the highlighting to be done inside the DocTransfomer not a search component, i.e. not needing the hl=onhl.fl=name bit. That is, this would be a great change for existing Solr users, but might be confusing for new Solr users. I did try to move the highlighting code itself into the DocTransformer, but stalled at the point at which it needed to be CoreAware, as DocTransformers aren't allowed to be. Without that, it isn't possible to access the Highlighter components in the core's configuration. Thoughts? Is this a useful feature? Upayavira
Re: Problem with german hyphenated words not being found
Have you used the analysis tab in the admin UI? You can type in sentences for both index and query time and see how they would be analysed by various fields/field types. Once you have got index time and query time to result in the same tokens at the end of the analysis chain, you should start seeing matches in your queries. Upayavira On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote: Hey, in german, you can string most nouns together by using hyphens, like this: Industrie = industry Anhänger = trailer Industrie-Anhänger = trailer for industrial use Here [1], you can see me querying Industrieanhänger from the name field (name:Industrieanhänger), to make sure the index actually contains the word. Our data is structured that products are listed without the hyphen. Now, customers can come around and use the hyphenated version as a search term (i.e.industrie-anhänger), and of course we want them to find what they are looking for. I've set it up so that the WordDelimiterFilterFactory uses catenateWords=1, so that these words are catenated. An analysis of Industrieanhänger as index and industrie-anhänger as query can be seen here [2]. You can see that both word parts are found. However, querying for industrie-anhänger does not yield results, only when the hyphen is removed, as you can see here [3]. I'm not sure how to proceed from here, as the results of the analysis have so far always lined up with what I could see when querying. Here's the schema definition for text, the field type for the name field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ -- filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've also thought it might be a problem with URL encoding not encoding the hyphen, but replacing it with %2D didn't change the outcome (and was probably wrong anyway). Any help is greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [2] http://i.imgur.com/H0AhEsF.png [3] http://imgur.com/dzmMe7t
Re: DocTransformers for restructuring output, e.g. Highlighting
Hi Upayavira, I was going to suggest SOLR-3479 to Edwin, I saw your old post. Regarding your suggestion, there is an existing ticket : https://issues.apache.org/jira/browse/SOLR-3479 I think SOLR-7665 is also relevant to your question. Ahmet On Sunday, June 23, 2013 9:54 PM, Upayavira u...@odoko.co.uk wrote: I've just taken a peek at the src for DocTransformers. They get given a TransformContext. That context contains the query and a few other bits and pieces. If it contained the response, DocTransformers would be able to do output restructuring. The best example is hit highlighting. If you did: hl=onhl.fl=namefl=*,[highlight:name] you would no longer need to seek the highlighted strings in another part of the output. The conceptual downside of this approach is that we might expect the highlighting to be done inside the DocTransfomer not a search component, i.e. not needing the hl=onhl.fl=name bit. That is, this would be a great change for existing Solr users, but might be confusing for new Solr users. I did try to move the highlighting code itself into the DocTransformer, but stalled at the point at which it needed to be CoreAware, as DocTransformers aren't allowed to be. Without that, it isn't possible to access the Highlighter components in the core's configuration. Thoughts? Is this a useful feature? Upayavira
Problem with german hyphenated words not being found
Hey, in german, you can string most nouns together by using hyphens, like this: Industrie = industry Anhänger = trailer Industrie-Anhänger = trailer for industrial use Here [1], you can see me querying Industrieanhänger from the name field (name:Industrieanhänger), to make sure the index actually contains the word. Our data is structured that products are listed without the hyphen. Now, customers can come around and use the hyphenated version as a search term (i.e.industrie-anhänger), and of course we want them to find what they are looking for. I've set it up so that the WordDelimiterFilterFactory uses catenateWords=1, so that these words are catenated. An analysis of Industrieanhänger as index and industrie-anhänger as query can be seen here [2]. You can see that both word parts are found. However, querying for industrie-anhänger does not yield results, only when the hyphen is removed, as you can see here [3]. I'm not sure how to proceed from here, as the results of the analysis have so far always lined up with what I could see when querying. Here's the schema definition for text, the field type for the name field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ -- filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've also thought it might be a problem with URL encoding not encoding the hyphen, but replacing it with %2D didn't change the outcome (and was probably wrong anyway). Any help is greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [2] http://i.imgur.com/H0AhEsF.png [3] http://imgur.com/dzmMe7t
Re: Indexing issue - index get deleted
Hi Chris, Amazing Analysis ! I did actually not investigated the log, because I was first trying to get more information from the user. We are running full import and delta import crons . Fulll index once a day delta index : every 10 mins last night my index automatically deleted(numdocs=0). attaching logs for review . Reading better the user initial mail , he does a full import as well ( and at this point, cleaning the Index) . Not sure is there any practical reason to do that, the user will clarify that to us. So after the clean happened, something prevented the full import to proceed, and we had the weird behaviour monitored in the logs. Really curious of understanding this better :) 2015-06-11 1:36 GMT+01:00 Chris Hostetter hossman_luc...@fucit.org: : The guys was using delta import anyway, so maybe the problem is : different and not related to the clean. that's not what the logs say. Here's what i see... Log begins with server startup @ Jun 10, 2015 11:14:56 AM The DeletionPolicy for the shopclue_prod core is initialized at Jun 10, 2015 11:15:04 AM and we see a few interesting things here we note for the future as we keep reading... 1) There is currently commits:num=1 commits on disk 2) the current index dir in use is index.20150311161021822 3) the current segment generation are segFN=segments_1a,generation=46 Immediately after this, we see some searcher warming using a searcher with this same segments file, and then this searcher is registered (Jun 10, 2015 11:15:05 AM) and the core is registered. Next we see some replication polling, and we see what look like some simple monitoring requests for q=* which return hits=85898 being repeated over and over. At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that look like they are coming from the UI. and then at Jun 10, 2015 11:17:01 AM we see a request for a full import started. We have no idea what the data import configuration file looks like, so we have no idea if clean=false is being used or not. it's certianly not specified in the URL. We see some more monitoring URLs returning hits=85898 and some more /repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the first commit executed since hte server started up. there's no indication that this commit came from an external request (eg /update) so probably was made by some internal request. One possiblility is that it came from DIH finishing -- but i doubt it, i'm fairly sure that would have involved more logging then this. A more probably scenerio is that it came from an autoCommit setting -- the fact that it is almost exactly 60 seconds after DIH started -- and almost exactly 60 seconds after DIH may have done a deleteAll query due to clean=true -- makes it seem very likely that this was a 1 minute autoCommit) (but since we don't have either hte data import config, or the solrconfig.xml, we have no way of knowing -- it's all just guess work.) Very importantly, note that this commit is not opening a new searcher... Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} Here are some other interesting things to note from the logging that comes from the DeletionPolicy when this commit happens... 1) it now notes that there are commits:num=2 on disk 2) the current index dir hasn't changed (index.20150311161021822) so some weird replication command didn't swap the world out from under us 3) the newest segment/generation are segFN=segments_1b,generation=47 4) the newest commit has no other files in it besides the segments file. this means, with out a doubt, there are no documents in this commits view of the index. they have all been deleted by something. At this point the *old* searcher (for commit generation 46) is still in use however -- nothing has done an openSearcher=true. we see more /dataimport status requests, and other requests that appear to come from the Solr UI, and more monitoring queries that still return hits=85898 because the same searcher is in use. At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no indication that this came from an outside /update request, so it might be from DIH, or it might be from an autoCommit setting. the fact that it is nearly exactly 10 minutes after DIH started (and probably did a clean=true deleteAll query) makes it seem extremely likely this is an autoSoftCommit setting kicking in. Very importantly, note that this softCommit *does* open a new searcher... Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false} In less then a second, this new searcher is warmed up and the next time we see a q=* monitoring query get
Re: Adding applicative cache to SolrSearcher
Works great, thanks guys! Missed the leafReader because I looked at IndexSearcher instead of SolrIndexSearcher... -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-applicative-cache-to-SolrSearcher-tp4211012p4211183.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Sets of Data from Two Different Sources
On 11/06/2015 14:57, Paden wrote: So you're saying that Tika can parse the text OUTSIDE of Solr. So I would still be able to process my PDF's with Tika just outside of Solr specifically correct? Yes. Charlie -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211172.html Sent from the Solr - User mailing list archive at Nabble.com. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Merging Sets of Data from Two Different Sources
You were very VERY helpful. Thank you very much. If I could bug you for one last question. Do you know where the documentation is that would help me write my own indexer? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html Sent from the Solr - User mailing list archive at Nabble.com.
DocValues memory consumption thoughts
I am using DocValues and I am wondering how to configure Solr's processes java's heap size: does DocValues uses system cache (off heap memory) or heap memory? should I take DocValues into consideration when I calculate heap parameters (xmx, xmn, xms...)? -- View this message in context: http://lucene.472066.n3.nabble.com/DocValues-memory-consumption-thoughts-tp4211187.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Sets of Data from Two Different Sources
I agree with all the ideas so far explained, but actually I would have suggested the DIH ( Data Import Handler) as a first plan. It does already allow out of the box indexing from different datasources. It supports Jdbc datasources with extensive processors and it does support also a file system datasource with the possibility of using the TikaEntityProcessor. So actually the requirement of the user can be reached directly with a single configuration of the DIH and a proper schema design. Of course if the situation gets more complicated there will be the necessity of customising some DIH component or proceeding writing a custom Indexer. Cheers 2015-06-11 16:20 GMT+01:00 Erick Erickson erickerick...@gmail.com: Here's a skeleton that uses Tika from a SolrJ client. It mixes in a database too, but the parts are pretty separate. https://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Thu, Jun 11, 2015 at 7:14 AM, Paden rumsey...@gmail.com wrote: You were very VERY helpful. Thank you very much. If I could bug you for one last question. Do you know where the documentation is that would help me write my own indexer? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Exact phrase search not working
This is my field definition: fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory collection=default-collection includeTokens=true replaceWhitespaceWith=_/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=false longestOnly=false / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory collection=default-collection includeTokens=true replaceWhitespaceWith=_/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ManagedSynonymFilterFactory managed=english / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.HunspellStemFilterFactory dictionary=en_US.dic affix=en_US.aff ignoreCase=false longestOnly=false / /analyzer /fieldType Then I query for this exact phrase (which I can see in various documents) and get no results... my_field: baltimore police force This is the output of the debugQuery part of the result set. rawquerystring: \baltimore police force\, querystring: \baltimore police force\, parsedquery: PhraseQuery(search_text:\baltimore ? police ? ? force\), parsedquery_toString: search_text:\baltimore ? police ? ? force\, QParser: LuceneQParser, Thanks, Mike
RE: The best way to exclude seen results from search queries
Thanks allot Charles, This seems to be what I'm looking for. Do you know if join for this amount of documents user will still have good query performance? also, is there any limitations for the solr architecture once using the join method (i.e. sharding)? Many thanks, Ami -- View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: The best way to exclude seen results from search queries
So long as the fields are indexed, I think performance should be ok. Personally, I would also look at using a single document per user with a multi-valued field for recommendation ID. Assuming only a small fraction of all recommendation IDs are ever presented to any single user, this schema would be physically much smaller and require only a single document per user. I don't know the answer to your sharding question. The join query is available out of the box, so it should be quick work to set up a two-shard sample and test the distributed sub-query. That said, with the scales you are talking about, I question if sharding is necessary. You can still use replication for load balancing without sharding. -Original Message- From: amid [mailto:a...@donanza.com] Sent: Thursday, June 11, 2015 12:36 PM To: solr-user@lucene.apache.org Subject: RE: The best way to exclude seen results from search queries Thanks allot Charles, This seems to be what I'm looking for. Do you know if join for this amount of documents user will still have good query performance? also, is there any limitations for the solr architecture once using the join method (i.e. sharding)? Many thanks, Ami -- View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html Sent from the Solr - User mailing list archive at Nabble.com. * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
RE: Show all fields in Solr highlighting output
Moving the highlighted snippets to the main response is a bad thing for some applications. E.g. if you do any sorting or searching on the returned fields, you need to use the original values. The same is true if any of the values are used as a key into some other system or table lookup. Specifically, the insertion of markup into the text changes values that affect sorting and matching. Thus the wisdom of the current design that returns highlighting results separately. Of course, it is very simple to merge the highlighting results into the returned documents. The highlighting results have been thoughtfully arranged as a lookup table using the unique ID field as the key. In SolrJ, this is a Map. Thus, you can loop over the result documents, lookup the highlight results for that document and overwrite the original value with the highlighted value. Be sure to set your snippet size bigger than the largest value you expect! Anyway, this type of thing is better handled by the application than Solr, per se. static int nDocs( QueryResponse response ) { int nReturned = 0; if ( null != response null != response.getResults() ) { nReturned = response.getResults().size(); } return nReturned; } static boolean hasHighlight( QueryResponse response ) { boolean hasHL = false; if ( null != response null != response.getHighlighting() ) { hasHL = response.getHighlighting().size() 0; } return hasHL; } protected void mergeHighlightResults( QueryResponse response, String uniqueIdField ) { if ( nDocs(response) 0 hasHighlight(response) ) { for ( SolrDocument result : response.getResults() ) { MapString, ListString hlDoc = response.getHighlighting().get( result.getFirstValue(uniqueIdField) ); if ( null != hlDoc hlDoc.size() 0 ) { for ( String fieldName : hlDoc.keySet() ) { ListString hlValues = hlDoc.get( fieldName ); // This is the only tricky bit: this logic may not work all that well for multi-valued fields. // You cannot reliably match the altered values to an original value. So, if any HL values // are returned, just replace all values with HL values. // This will not work 100% of the time. int ix = 0; for ( String hlVal : hlValues ) { if ( 0 == ix++ ) { result.setField( fieldName, hlVal ); } else { result.addField( fieldName, hlVal ); } } } } } } } -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: Thursday, June 11, 2015 6:43 AM To: solr-user@lucene.apache.org Subject: Re: Show all fields in Solr highlighting output Hi Edwin, I think Highlighting Behaviour of those types shifts over time. May be we should do the reverse. Move snippets to main response: https://issues.apache.org/jira/browse/SOLR-3479 Ahmet On Thursday, June 11, 2015 11:23 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Ahmet, I've tried that, but it's still not able to show. Those fields are actually of type=float, type=date and type=int. By default those field type are not able to be highlighted? Regards, Edwin On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Edwin, hl.alternateField is probably what you are looking for. ahmet On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Is it possible to list all the fields in the highlighting portion in the output? Currently,even when I str name=hl.fl*/str, it only shows fields where highlighting is possible, and fields which highlighting is not possible is not shown. I would like to have the output where all the fields, regardless if highlighting is possible or not, to be shown together. Regards, Edwin * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF
How to index/search without whitespace but hightlight with whitespace?
Hey everyone! I'm trying to setup a Solr instance on some free text clinical data. This data has a lot of white space formatting, for example, I might have a document that contains unstructured bulleted lists or section titles. For example, blah blah blah... MEDICATIONS: * Xanax * Phenobritrol DIAGNOSIS: blah blah blah... When indexing (and thus querying) this document, I use a text field with tokenization, stemming, etc, lets call it text. Unfortunately, when I try to print highlighted results, the newlines and whitespace are obviously not preserved. In an attempt to get around this, I created a second field in the index that stores the full content of each document as a string, thus preserving the whitespace, called raw_text. If I setup the search page to search on the text field, but highlight on the text_raw field, then the highlighted matches don't always line up. Is there a way to some how project the stemmed matches from the text field onto the text_raw field when displaying hightlighting? Thank you for your time, Travis
RE: The best way to exclude seen results from search queries
Thanks Charles, We though of using multi-valued field but got the feeling it will not be small as our data will grow. Another issue with multi-valued field is that you can't create complex join query, while using a different collection with document with more than one field (e.g. recommendation_date) can help us easily delete/limit the amount of time this recommendation will not be shown again. Thanks for your answer, seems like replication load balancing will be good enough for now :) Thanks allot, Ami -- View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211239.html Sent from the Solr - User mailing list archive at Nabble.com.
Lucene/Solr Revolution 2015 Voting
Hey Folks, If you're interested in going to Lucene/Solr Revolution this year in Austin, please vote for the sessions you would like to see! https://lucenerevolution.uservoice.com/ -Yonik
Re: Merging Sets of Data from Two Different Sources
I do have a link between both sets of data and that would be the filepath that could be indexed by both. I do, however, have large PDF's that do need to be indexed. So just for clarification, I could write an indexer that used both the DIH and SolrCell to submit a combined record to Solr or would there be a different process if I used these methods instead? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211169.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Sets of Data from Two Different Sources
On 11/06/2015 14:38, Paden wrote: I do have a link between both sets of data and that would be the filepath that could be indexed by both. Great. I do, however, have large PDF's that do need to be indexed. So just for clarification, I could write an indexer that used both the DIH and SolrCell to submit a combined record to Solr or would there be a different process if I used these methods instead? No, I'm suggesting you write an indexer that doesn't use either DIH or SolrCell. Charlie -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211169.html Sent from the Solr - User mailing list archive at Nabble.com. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Separate network interfaces for inter-node communication and update/search requests?
On 6/11/2015 6:47 AM, MOIS Martin (MORPHO) wrote: is it possible to separate the network interface for inter-node communication from the network interface for update/search requests? If so I could put two network cards in each machine and route the index and search traffic over the first interface and the traffic for the inter-node communication (sending documents to replicas) over the second interface. Assuming you are using SolrCloud, you would do this by using the name or IP address of the internal communication interface on the host parameter in your solr.xml file (or -Dhost=foo on the startup commandline). This will cause each node to register itself with zookeeper using that interface. Note that what I've said above probably will not work with a cloud-aware client like CloudSolrClient/CloudSolrServer in SolrJ, because that client will obtain the server/port for each node from zookeeper and try to contact each one directly. The necessary routing probably will not be in place. If it's not SolrCloud, then the shards parameter that you are using for distributed search would need internal names/addresses. The other interface, for queries and updates, would be the one with the default gateway. Thanks, Shawn
Re: Merging Sets of Data from Two Different Sources
On 11/06/2015 14:19, Paden wrote: I'm trying to figure out if Solr is a good fit for my project. I have two sets of data. On the one hand there is a bunch of files sitting in a local file system in a Linux file system. On the other is a set of metadata FOR the files that is located in a MySQL database. I need a program that can merge BOTH sets of data into one index. Meaning that the metadata in the database will attach/merge with the file data(the text) from the file system to create one searchable indexed item for each document in the file system. The metadata located in the database contains information that is vital to a faceted search of the documents located in the file system. Would Solr accomplish my goals? And if so, what tools can it provide to do so? If you can link the files and the metadata easily, then this shouldn't be hard (i.e. you have some common identifier). We would write an indexer in Python that extracted data from MySQL, crawled the filesystem and used Apache Tika to extract plain text from the files, then submitted a combined record to Solr for indexing. You'll need to decide on a schema for the combined record of course. There are alternatives (DataImportHandler for the database, SolrCell for submitting the files directly) but we prefer to keep the file handling in particular outside of Solr (as large PDFs for example can kill Tika and thus Solr itself). Cheers Charlie -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166.html Sent from the Solr - User mailing list archive at Nabble.com. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Separate network interfaces for inter-node communication and update/search requests?
Hello, is it possible to separate the network interface for inter-node communication from the network interface for update/search requests? If so I could put two network cards in each machine and route the index and search traffic over the first interface and the traffic for the inter-node communication (sending documents to replicas) over the second interface. Best Regards, Martin Mois # This e-mail and any attached documents may contain confidential or proprietary information. If you are not the intended recipient, you are notified that any dissemination, copying of this e-mail and any attachments thereto or use of their contents by any means whatsoever is strictly prohibited. If you have received this e-mail in error, please advise the sender immediately and delete this e-mail and all attached documents from your computer system. #
Merging Sets of Data from Two Different Sources
I'm trying to figure out if Solr is a good fit for my project. I have two sets of data. On the one hand there is a bunch of files sitting in a local file system in a Linux file system. On the other is a set of metadata FOR the files that is located in a MySQL database. I need a program that can merge BOTH sets of data into one index. Meaning that the metadata in the database will attach/merge with the file data(the text) from the file system to create one searchable indexed item for each document in the file system. The metadata located in the database contains information that is vital to a faceted search of the documents located in the file system. Would Solr accomplish my goals? And if so, what tools can it provide to do so? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Sets of Data from Two Different Sources
So you're saying that Tika can parse the text OUTSIDE of Solr. So I would still be able to process my PDF's with Tika just outside of Solr specifically correct? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211172.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with german hyphenated words not being found
Thank you for your input. Here's how the query looks with debugQuery=true: rawquerystring: name:industrie-anhänger, querystring: name:industrie-anhänger, parsedquery: MultiPhraseQuery(name:(industrie-anhang industri) (anhang industrieanhang)), parsedquery_toString: name:(industrie-anhang industri) (anhang industrieanhang), It looks like there are some rules applied, expressed by the braces. What's the correct interpretation of that? The default operator is OR, yet this looks like the terms inside the braces group using AND. Am 11.06.2015 12:40 schrieb Upayavira: The next thing to do is add debugQuery=true to your URL (or enable it in the query pane of the admin UI). Then look for the parsed query info. On the standard text_en field which includes an English stop word filter, I ran a query on Jack and Jill's House which showed this output: rawquerystring: text_en:(Jack and Jill's House), querystring: text_en:(Jack and Jill's House), parsedquery: text_en:jack text_en:jill text_en:hous, parsedquery_toString: text_en:jack text_en:jill text_en:hous, You can see that the parsed query is formed *after* analysis, so you can see exactly what is being queried for. Also, as a corollary to this, you can use the schema browser (or faceting for that matter) to view what terms are being indexed, to see if they should match. HTH Upayavira Am 11.06.2015 12:00 schrieb Upayavira: Have you used the analysis tab in the admin UI? You can type in sentences for both index and query time and see how they would be analysed by various fields/field types. Once you have got index time and query time to result in the same tokens at the end of the analysis chain, you should start seeing matches in your queries. Upayavira On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote: Hey, in german, you can string most nouns together by using hyphens, like this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer for industrial use Here [1[1]], you can see me querying Industrieanhänger from the name field (name:Industrieanhänger), to make sure the index actually contains the word. Our data is structured that products are listed without the hyphen. Now, customers can come around and use the hyphenated version as a search term (i.e.industrie-anhänger), and of course we want them to find what they are looking for. I've set it up so that the WordDelimiterFilterFactory uses catenateWords=1, so that these words are catenated. An analysis of Industrieanhänger as index and industrie-anhänger as query can be seen here [2[2]]. You can see that both word parts are found. However, querying for industrie- anhänger does not yield results, only when the hyphen is removed, as you can see here [3[3]]. I'm not sure how to proceed from here, as the results of the analysis have so far always lined up with what I could see when querying. Here's the schema definition for text, the field type for the name field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ -- filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've also thought it might be a problem with URL encoding not encoding the hyphen, but replacing it with %2D didn't change the outcome (and was probably wrong anyway). Any help is greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [1] [2]
Re: Phrase Highlighter + Surround Query Parser
Picking up this thread again... When you said 'stock one' you meant in built surround Query parser of customized? We already use usePhrasehighlighter=true. On Mon, Aug 4, 2014 at 10:38 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, You are using a customized surround query parser, right? Did you check/try with the stock one? I recall correctly usePhrasehighlighter=true was working in the past for surround. Ahmet On Monday, August 4, 2014 8:25 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Anyone? On Fri, Aug 1, 2014 at 12:31 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: We are having an issue in Phrase highlighter with Surround Query Parser e.g. *first thing w/100 you must *brings correct results but also highlights individual words of the phrase - first, thing are highlighted where they come separately as well. Any idea how this can be fixed? -- Regards, Salman Akram -- Regards, Salman Akram -- Regards, Salman Akram
Re: Separate network interfaces for inter-node communication and update/search requests?
Modern network interfaces are pretty capable. I would doubt this optimization would yield any performance improvements. I would love to see some test results which prove me wrong. is performance the primary reason for this? or do you have any other reasons. -Ani On Thu, Jun 11, 2015 at 9:04 AM, Shawn Heisey apa...@elyograg.org wrote: On 6/11/2015 6:47 AM, MOIS Martin (MORPHO) wrote: is it possible to separate the network interface for inter-node communication from the network interface for update/search requests? If so I could put two network cards in each machine and route the index and search traffic over the first interface and the traffic for the inter-node communication (sending documents to replicas) over the second interface. Assuming you are using SolrCloud, you would do this by using the name or IP address of the internal communication interface on the host parameter in your solr.xml file (or -Dhost=foo on the startup commandline). This will cause each node to register itself with zookeeper using that interface. Note that what I've said above probably will not work with a cloud-aware client like CloudSolrClient/CloudSolrServer in SolrJ, because that client will obtain the server/port for each node from zookeeper and try to contact each one directly. The necessary routing probably will not be in place. If it's not SolrCloud, then the shards parameter that you are using for distributed search would need internal names/addresses. The other interface, for queries and updates, would be the one with the default gateway. Thanks, Shawn -- Anirudha P. Jadhav
Re: Merging Sets of Data from Two Different Sources
The filepath is the key in both the filesystem and the database -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211253.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Sets of Data from Two Different Sources
Both sources, the filesystem and the database, contain the file path for each individual file -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211251.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Sets of Data from Two Different Sources
One question is which source defines the key - do you crawl the files and then look up the file name in the database, or scan the database and there is a field to specify the file name? IOW, given a database key, is there a fixed method to determine the file name path? And vice versa. -- Jack Krupansky On Thu, Jun 11, 2015 at 11:48 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I agree with all the ideas so far explained, but actually I would have suggested the DIH ( Data Import Handler) as a first plan. It does already allow out of the box indexing from different datasources. It supports Jdbc datasources with extensive processors and it does support also a file system datasource with the possibility of using the TikaEntityProcessor. So actually the requirement of the user can be reached directly with a single configuration of the DIH and a proper schema design. Of course if the situation gets more complicated there will be the necessity of customising some DIH component or proceeding writing a custom Indexer. Cheers 2015-06-11 16:20 GMT+01:00 Erick Erickson erickerick...@gmail.com: Here's a skeleton that uses Tika from a SolrJ client. It mixes in a database too, but the parts are pretty separate. https://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Thu, Jun 11, 2015 at 7:14 AM, Paden rumsey...@gmail.com wrote: You were very VERY helpful. Thank you very much. If I could bug you for one last question. Do you know where the documentation is that would help me write my own indexer? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Merging Sets of Data from Two Different Sources
So you're saying I could merge both the metadata in the database and their files in the file system into one query-able item in solr by just customizing the DIH correctly and getting the right schema? (I'm sorry this sounds like a redundant question but I've been trying to find an answer for the past couple of days and it seems like people sometimes misunderstand what I'm asking) -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211248.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Merging Sets of Data from Two Different Sources
Yes. Typically, the content file is used to populate a single field in each document, e.g. content. Typically, this field is the primary target for searches.Sometimes, additional metadata (title, author, etc.) can be extracted from the source files. But the idea remains the same: the two sources (database record + file) are merged into single searchable document in solr. If you write your own indexer using SolrJ, you have more control the loading process and, imo, the approach is clearer. All the pieces come together in one place. But Alessandro says the same result is achievable using DataImportHandler. Probably worth a try before writing code... -Original Message- From: Paden [mailto:rumsey...@gmail.com] Sent: Thursday, June 11, 2015 4:14 PM To: solr-user@lucene.apache.org Subject: Re: Merging Sets of Data from Two Different Sources So you're saying I could merge both the metadata in the database and their files in the file system into one query-able item in solr by just customizing the DIH correctly and getting the right schema? (I'm sorry this sounds like a redundant question but I've been trying to find an answer for the past couple of days and it seems like people sometimes misunderstand what I'm asking) -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211248.html Sent from the Solr - User mailing list archive at Nabble.com. * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: Index optimize runs in background.
Why would you care when the forced merge (not an “optimize”) is done? Start it and get back to work. Or even better, never force merge and let the algorithm take care of it. Seriously, I’ve been giving this advice since before Lucene was written, because Ultraseek had the same approach for managing index segments. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 10, 2015, at 10:35 PM, Erick Erickson erickerick...@gmail.com wrote: If I knew, I would fix it ;). The sub-optimizes (i.e. the ones sent out to each replica) should be sent in parallel and then each thread should wait for completion from the replicas. There is no real check for optimize, I believe that the return from the call is considered sufficient. If we can track down if there are conditions under which this is not true we can fix it. But until there's a way to reproduce it, it's pretty much speculation. Best, Erick On Wed, Jun 10, 2015 at 10:14 PM, Modassar Ather modather1...@gmail.com wrote: Hi, There are 5 cores and a separate server for indexing on this solrcloud. Can you please share your suggestions on: How can indexer know that the optimize has completed even if the commit/optimize runs in background without going to the solr servers may be by using any solrj or other API? I tried but could not find any API/handler to check if the optimizations is completed. Kindly share your inputs. Thanks, Modassar On Thu, Jun 4, 2015 at 9:36 PM, Erick Erickson erickerick...@gmail.com wrote: Can't get any failures to happen on my end so I really haven't a clue. Best, Erick On Thu, Jun 4, 2015 at 3:17 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Please provide your inputs on optimize and commit running as background. Your suggestion will be really helpful. Thanks, Modassar On Tue, Jun 2, 2015 at 6:05 PM, Modassar Ather modather1...@gmail.com wrote: Erick! I could not find any underlying setting of 10 minutes. It is not only optimize but commit is also behaving in the same fashion and is taking lesser time than usually had taken. As per my observation both are running in background. On Fri, May 29, 2015 at 7:21 PM, Erick Erickson erickerick...@gmail.com wrote: I'm not talking about you setting a timeout, but the underlying connection timing out... The 10 minutes then the indexer exits comment points in that direction. Best, Erick On Thu, May 28, 2015 at 11:43 PM, Modassar Ather modather1...@gmail.com wrote: I have not added any timeout in the indexer except zk client time out which is 30 seconds. I am simply calling client.close() at the end of indexing. The same code was not running in background for optimize with solr-4.10.3 and org.apache.solr.client.solrj.impl.CloudSolrServer. On Fri, May 29, 2015 at 11:13 AM, Erick Erickson erickerick...@gmail.com wrote: Are you timing out on the client request? The theory here is that it's still a synchronous call, but you're just timing out at the client level. At that point, the optimize is still running it's just the connection has been dropped Shot in the dark. Erick On Thu, May 28, 2015 at 10:31 PM, Modassar Ather modather1...@gmail.com wrote: I could not notice it but with my past experience of commit which used to take around 2 minutes is now taking around 8 seconds. I think this is also running as background. On Fri, May 29, 2015 at 10:52 AM, Modassar Ather modather1...@gmail.com wrote: The indexer takes almost 2 hours to optimize. It has a multi-threaded add of batches of documents to org.apache.solr.client.solrj.impl.CloudSolrClient. Once all the documents are indexed it invokes commit and optimize. I have seen that the optimize goes into background after 10 minutes and indexer exits. I am not sure why this 10 minutes it hangs on indexer. This behavior I have seen in multiple iteration of the indexing of same data. There is nothing significant I found in log which I can share. I can see following in log. org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} On Wed, May 27, 2015 at 10:59 PM, Erick Erickson erickerick...@gmail.com wrote: All strange of course. What do your Solr logs show when this happens? And how reproducible is this? Best, Erick On Wed, May 27, 2015 at 4:00 AM, Upayavira u...@odoko.co.uk wrote: In this case, optimising makes sense, once the index is generated, you are not updating It. Upayavira On Wed, May 27, 2015, at 06:14 AM, Modassar Ather wrote: Our index has almost 100M documents running on SolrCloud of 5 shards and each shard has an index size of about 170+GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week
Re: Indexing issue - index get deleted
Thanks . for replying .. please find the data-config On Thu, Jun 11, 2015 at 6:06 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : The guys was using delta import anyway, so maybe the problem is : different and not related to the clean. that's not what the logs say. Here's what i see... Log begins with server startup @ Jun 10, 2015 11:14:56 AM The DeletionPolicy for the shopclue_prod core is initialized at Jun 10, 2015 11:15:04 AM and we see a few interesting things here we note for the future as we keep reading... 1) There is currently commits:num=1 commits on disk 2) the current index dir in use is index.20150311161021822 3) the current segment generation are segFN=segments_1a,generation=46 Immediately after this, we see some searcher warming using a searcher with this same segments file, and then this searcher is registered (Jun 10, 2015 11:15:05 AM) and the core is registered. Next we see some replication polling, and we see what look like some simple monitoring requests for q=* which return hits=85898 being repeated over and over. At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that look like they are coming from the UI. and then at Jun 10, 2015 11:17:01 AM we see a request for a full import started. We have no idea what the data import configuration file looks like, so we have no idea if clean=false is being used or not. it's certianly not specified in the URL. We see some more monitoring URLs returning hits=85898 and some more /repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the first commit executed since hte server started up. there's no indication that this commit came from an external request (eg /update) so probably was made by some internal request. One possiblility is that it came from DIH finishing -- but i doubt it, i'm fairly sure that would have involved more logging then this. A more probably scenerio is that it came from an autoCommit setting -- the fact that it is almost exactly 60 seconds after DIH started -- and almost exactly 60 seconds after DIH may have done a deleteAll query due to clean=true -- makes it seem very likely that this was a 1 minute autoCommit) (but since we don't have either hte data import config, or the solrconfig.xml, we have no way of knowing -- it's all just guess work.) Very importantly, note that this commit is not opening a new searcher... Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} Here are some other interesting things to note from the logging that comes from the DeletionPolicy when this commit happens... 1) it now notes that there are commits:num=2 on disk 2) the current index dir hasn't changed (index.20150311161021822) so some weird replication command didn't swap the world out from under us 3) the newest segment/generation are segFN=segments_1b,generation=47 4) the newest commit has no other files in it besides the segments file. this means, with out a doubt, there are no documents in this commits view of the index. they have all been deleted by something. At this point the *old* searcher (for commit generation 46) is still in use however -- nothing has done an openSearcher=true. we see more /dataimport status requests, and other requests that appear to come from the Solr UI, and more monitoring queries that still return hits=85898 because the same searcher is in use. At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no indication that this came from an outside /update request, so it might be from DIH, or it might be from an autoCommit setting. the fact that it is nearly exactly 10 minutes after DIH started (and probably did a clean=true deleteAll query) makes it seem extremely likely this is an autoSoftCommit setting kicking in. Very importantly, note that this softCommit *does* open a new searcher... Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false} In less then a second, this new searcher is warmed up and the next time we see a q=* monitoring query get logged, it returns hits=0. Note that at no point in the logs, after the DataImporter is started, do we see it log anything other then that it has initiated the request to MySQL -- we do see some logs starting ~ Jun 10, 2015 11:41:19 AM indicating that someone was using the Web UI to look at the dataimport handler's status report. it would be really nice to know what that person saw at that point -- because my guess is DIH was still running and was staled waiting for MySql, and hadn't even started adding docs to Solr (if it had, i'm certian there would have been some log of it). So instead, the combination of a
AW: How to assign shard to specifc node?
Thank you for your quick answer. The two parameters createNodeSet and createNodeSet.shuffle seem to solve the problem: http://localhost:8983/solr/admin/collections?action=CREATEname=mycollectionnumShards=3router.name=implicitshards=shard1,shard2,shard3router.field=shardcreateNodeSet=node1,node2,node3createNodeSet.shuffle=false Best Regards, Martin Mois -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Mittwoch, 10. Juni 2015 17:45 An: solr-user@lucene.apache.org Betreff: Re: How to assign shard to specifc node? Take a look at the collections API CREATE command in more detail here: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1 Admittedly this is 5.2 but you didn't mention what version of Solr you're using. In particular the createNodeSet and createNodeSet.shuffle parameters. Best, Erick On Wed, Jun 10, 2015 at 8:31 AM, MOIS Martin (MORPHO) martin.m...@morpho.com wrote: Hello, I have a cluster with 3 nodes (node1, node2 and node3). Now I want to create a new collection with 3 shards using `implicit` routing: http://localhost:8983/solr/admin/collections?action=CREATEname=mycoll ectionnumShards=3router.name=implicitshards=shard1,shard2,shard3ro uter.field=shard How can I control on which node each shard gets created? The goal is to create shard1 on node1, shard2 on node2, etc.. The background is that the actual raw data the index is created for should reside on the same host. That means I have a raw record composed of different data (documents, images, meta-data, etc.) for which I compute a Lucene document that gets indexed. In order to reduce network traffic I want to process the raw record on node1 and insert the resulting Lucene document into shard1 that resides on node1. If shard1 would reside on node2, the Lucene document would have to be send from node1 to node2 which causes for big record sets a lot of inter node communication. Thanks in advance. Best Regards, Martin Mois # This e-mail and any attached documents may contain confidential or proprietary information. If you are not the intended recipient, you are notified that any dissemination, copying of this e-mail and any attachments thereto or use of their contents by any means whatsoever is strictly prohibited. If you have received this e-mail in error, please advise the sender immediately and delete this e-mail and all attached documents from your computer system. # # This e-mail and any attached documents may contain confidential or proprietary information. If you are not the intended recipient, you are notified that any dissemination, copying of this e-mail and any attachments thereto or use of their contents by any means whatsoever is strictly prohibited. If you have received this e-mail in error, please advise the sender immediately and delete this e-mail and all attached documents from your computer system. #
Re: Show all fields in Solr highlighting output
Hi Edwin, hl.alternateField is probably what you are looking for. ahmet On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Is it possible to list all the fields in the highlighting portion in the output? Currently,even when I str name=hl.fl*/str, it only shows fields where highlighting is possible, and fields which highlighting is not possible is not shown. I would like to have the output where all the fields, regardless if highlighting is possible or not, to be shown together. Regards, Edwin
Re: Index optimize runs in background.
Until somewhere around Lucene 3.5, you needed to optimise, because the merge strategy used wasn't that clever and left lots of deletes in your largest segment. Around that point, the TieredMergePolicy became the default. Because its algorithm is much more sophisticated, it took away the need to optimize in the majority of scenarios. In fact, it transformed optimizing from being a necessary thing to being a bad thing in most cases. So yes, let the algorithm take care of it, so long as you are using the TieredMergePolicy, which has been the default for over 2 years. Upayavira On Thu, Jun 11, 2015, at 07:01 AM, Walter Underwood wrote: Why would you care when the forced merge (not an “optimize”) is done? Start it and get back to work. Or even better, never force merge and let the algorithm take care of it. Seriously, I’ve been giving this advice since before Lucene was written, because Ultraseek had the same approach for managing index segments. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 10, 2015, at 10:35 PM, Erick Erickson erickerick...@gmail.com wrote: If I knew, I would fix it ;). The sub-optimizes (i.e. the ones sent out to each replica) should be sent in parallel and then each thread should wait for completion from the replicas. There is no real check for optimize, I believe that the return from the call is considered sufficient. If we can track down if there are conditions under which this is not true we can fix it. But until there's a way to reproduce it, it's pretty much speculation. Best, Erick On Wed, Jun 10, 2015 at 10:14 PM, Modassar Ather modather1...@gmail.com wrote: Hi, There are 5 cores and a separate server for indexing on this solrcloud. Can you please share your suggestions on: How can indexer know that the optimize has completed even if the commit/optimize runs in background without going to the solr servers may be by using any solrj or other API? I tried but could not find any API/handler to check if the optimizations is completed. Kindly share your inputs. Thanks, Modassar On Thu, Jun 4, 2015 at 9:36 PM, Erick Erickson erickerick...@gmail.com wrote: Can't get any failures to happen on my end so I really haven't a clue. Best, Erick On Thu, Jun 4, 2015 at 3:17 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Please provide your inputs on optimize and commit running as background. Your suggestion will be really helpful. Thanks, Modassar On Tue, Jun 2, 2015 at 6:05 PM, Modassar Ather modather1...@gmail.com wrote: Erick! I could not find any underlying setting of 10 minutes. It is not only optimize but commit is also behaving in the same fashion and is taking lesser time than usually had taken. As per my observation both are running in background. On Fri, May 29, 2015 at 7:21 PM, Erick Erickson erickerick...@gmail.com wrote: I'm not talking about you setting a timeout, but the underlying connection timing out... The 10 minutes then the indexer exits comment points in that direction. Best, Erick On Thu, May 28, 2015 at 11:43 PM, Modassar Ather modather1...@gmail.com wrote: I have not added any timeout in the indexer except zk client time out which is 30 seconds. I am simply calling client.close() at the end of indexing. The same code was not running in background for optimize with solr-4.10.3 and org.apache.solr.client.solrj.impl.CloudSolrServer. On Fri, May 29, 2015 at 11:13 AM, Erick Erickson erickerick...@gmail.com wrote: Are you timing out on the client request? The theory here is that it's still a synchronous call, but you're just timing out at the client level. At that point, the optimize is still running it's just the connection has been dropped Shot in the dark. Erick On Thu, May 28, 2015 at 10:31 PM, Modassar Ather modather1...@gmail.com wrote: I could not notice it but with my past experience of commit which used to take around 2 minutes is now taking around 8 seconds. I think this is also running as background. On Fri, May 29, 2015 at 10:52 AM, Modassar Ather modather1...@gmail.com wrote: The indexer takes almost 2 hours to optimize. It has a multi-threaded add of batches of documents to org.apache.solr.client.solrj.impl.CloudSolrClient. Once all the documents are indexed it invokes commit and optimize. I have seen that the optimize goes into background after 10 minutes and indexer exits. I am not sure why this 10 minutes it hangs on indexer. This behavior I have seen in multiple iteration of the indexing of same data. There is nothing significant I found in log which I can share. I can see following in log. org.apache.solr.update.DirectUpdateHandler2; start
Increase the suggester len size
Hi, I'm facing some issues with my suggester for the content field. As my content is indexed from rich text documents which is quite large, I got the following error when I tried to build the suggester using /suggesthandler?suggest.build=true lst name=error str name=msglen must be = 32767; got 35578/str Is there anyway to increase the len size to bigger than 32767? I might have documents that's even bigger in the future. Regards, Edwin
Re: Show all fields in Solr highlighting output
Thank you for the info, Will try to implement it. Regards, Edwin On 12 June 2015 at 01:32, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Moving the highlighted snippets to the main response is a bad thing for some applications. E.g. if you do any sorting or searching on the returned fields, you need to use the original values. The same is true if any of the values are used as a key into some other system or table lookup. Specifically, the insertion of markup into the text changes values that affect sorting and matching. Thus the wisdom of the current design that returns highlighting results separately. Of course, it is very simple to merge the highlighting results into the returned documents. The highlighting results have been thoughtfully arranged as a lookup table using the unique ID field as the key. In SolrJ, this is a Map. Thus, you can loop over the result documents, lookup the highlight results for that document and overwrite the original value with the highlighted value. Be sure to set your snippet size bigger than the largest value you expect! Anyway, this type of thing is better handled by the application than Solr, per se. static int nDocs( QueryResponse response ) { int nReturned = 0; if ( null != response null != response.getResults() ) { nReturned = response.getResults().size(); } return nReturned; } static boolean hasHighlight( QueryResponse response ) { boolean hasHL = false; if ( null != response null != response.getHighlighting() ) { hasHL = response.getHighlighting().size() 0; } return hasHL; } protected void mergeHighlightResults( QueryResponse response, String uniqueIdField ) { if ( nDocs(response) 0 hasHighlight(response) ) { for ( SolrDocument result : response.getResults() ) { MapString, ListString hlDoc = response.getHighlighting().get( result.getFirstValue(uniqueIdField) ); if ( null != hlDoc hlDoc.size() 0 ) { for ( String fieldName : hlDoc.keySet() ) { ListString hlValues = hlDoc.get( fieldName ); // This is the only tricky bit: this logic may not work all that well for multi-valued fields. // You cannot reliably match the altered values to an original value. So, if any HL values // are returned, just replace all values with HL values. // This will not work 100% of the time. int ix = 0; for ( String hlVal : hlValues ) { if ( 0 == ix++ ) { result.setField( fieldName, hlVal ); } else { result.addField( fieldName, hlVal ); } } } } } } } -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: Thursday, June 11, 2015 6:43 AM To: solr-user@lucene.apache.org Subject: Re: Show all fields in Solr highlighting output Hi Edwin, I think Highlighting Behaviour of those types shifts over time. May be we should do the reverse. Move snippets to main response: https://issues.apache.org/jira/browse/SOLR-3479 Ahmet On Thursday, June 11, 2015 11:23 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi Ahmet, I've tried that, but it's still not able to show. Those fields are actually of type=float, type=date and type=int. By default those field type are not able to be highlighted? Regards, Edwin On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Edwin, hl.alternateField is probably what you are looking for. ahmet On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Is it possible to list all the fields in the highlighting portion in the output? Currently,even when I str name=hl.fl*/str, it only shows fields where highlighting is possible, and fields which highlighting is not possible is not shown. I would like to have the output where all the fields, regardless if highlighting is possible or not, to be shown together. Regards, Edwin * This e-mail may contain
Re: Show all fields in Solr highlighting output
Hi Ahmet, I've tried that, but it's still not able to show. Those fields are actually of type=float, type=date and type=int. By default those field type are not able to be highlighted? Regards, Edwin On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Edwin, hl.alternateField is probably what you are looking for. ahmet On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Is it possible to list all the fields in the highlighting portion in the output? Currently,even when I str name=hl.fl*/str, it only shows fields where highlighting is possible, and fields which highlighting is not possible is not shown. I would like to have the output where all the fields, regardless if highlighting is possible or not, to be shown together. Regards, Edwin
Re: DocValues memory consumption thoughts
m DocValues actually is an un-inverted index that is built as part of the segment. This means that it has the same behaviour of the other segments files. Assuming you are indexing not a compound segment file but a classic multi filed segment in a NRTCachingDirectory, The segment is built in memory , and when it reaches the ramBufferSizeMB/ Hard commit it is flushed to the disk. This means that in my opinion there is no particular observation of memory degradation in using the DocValues. I would actually say that using DocValues instead the old FieldCache is decreasing the memory consumption, as FiedlChaces are completely in memory ( with the expensive un-inverting process) From Solr wiki : In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster. I would manage memory more accordingly to the other feature you will use ! Let me know if I satisfied your curiosity! Cheers 2015-06-11 15:38 GMT+01:00 adfel70 adfe...@gmail.com: I am using DocValues and I am wondering how to configure Solr's processes java's heap size: does DocValues uses system cache (off heap memory) or heap memory? should I take DocValues into consideration when I calculate heap parameters (xmx, xmn, xms...)? -- View this message in context: http://lucene.472066.n3.nabble.com/DocValues-memory-consumption-thoughts-tp4211187.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Merging Sets of Data from Two Different Sources
Here's a skeleton that uses Tika from a SolrJ client. It mixes in a database too, but the parts are pretty separate. https://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Thu, Jun 11, 2015 at 7:14 AM, Paden rumsey...@gmail.com wrote: You were very VERY helpful. Thank you very much. If I could bug you for one last question. Do you know where the documentation is that would help me write my own indexer? -- View this message in context: http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html Sent from the Solr - User mailing list archive at Nabble.com.