Re: documentCache not used in 4.3.1?
We see similar results, again we softCommit every 1s (trying to get as NRT as we can), and we very rarely get any hits in our caches. As an unscheduled test last week, we did shutdown indexing and noticed about 80% hit rate in caches (and average query time dropped from ~1s to 100ms!) so I think we are in the same position as you. I appreciate with such a frequent soft commit that the caches get invalidated, but I was expecting cache warming to help though it doesn't appear to be. We *don't* currently run a warming query, my impression of NRT was that it was better to not do that as otherwise you spend more time warming the searcher and caches, and by the time you've done all that, the searcher is invalidated anyway! On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote: That's a good idea, I'll try that next week. Thanks! Tim On 29/06/13 12:39 PM, Erick Erickson wrote: Tim: Yeah, this doesn't make much sense to me either since, as you say, you should be seeing some metrics upon occasion. But do note that the underlying cache only gets filled when getting documents to return in query results, since there's no autowarming going on it may come and go. But you can test this pretty quickly by lengthening your autocommit interval or just not indexing anything for a while, then run a bunch of queries and look at your cache stats. That'll at least tell you whether it works at all. You'll have to have hard commits turned off (or openSearcher set to 'false') for that check too. Best Erick On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Timtvaillanco...@ea.com* *wrote: Yes, we are softCommit'ing every 1000ms, but that should be enough time to see metrics though, right? For example, I still get non-cumulative metrics from the other caches (which are also throw away). I've also curl/sampled enough that I probably should have seen a value by now. If anyone else can reproduce this on 4.3.1 I will feel less crazy :). Cheers, Tim -Original Message- From: Erick Erickson [mailto:erickerickson@gmail.**comerickerick...@gmail.com ] Sent: Saturday, June 29, 2013 10:13 AM To: solr-user@lucene.apache.org Subject: Re: documentCache not used in 4.3.1? It's especially weird that the hit ratio is so high and you're not seeing anything in the cache. Are you perhaps soft committing frequently? Soft commits throw away all the top-level caches including documentCache I think Erick On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourttim@elementspace.**comt...@elementspace.com wrote: Thanks Otis, Yeah I realized after sending my e-mail that doc cache does not warm, however I'm still lost on why there are no other metrics. Thanks! Tim On 28 June 2013 16:22, Otis Gospodneticotis.gospodnetic@**gmail.comotis.gospodne...@gmail.com wrote: Hi Tim, Not sure about the zeros in 4.3.1, but in SPM we see all these numbers are non-0, though I haven't had the chance to confirm with Solr 4.3.1. Note that you can't really autowarm document cache... Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, This has to be a stupid question/I must be doing something wrong, but after frequent load testing with documentCache enabled under Solr 4.3.1 with autoWarmCount=150, I'm noticing that my documentCache metrics are always zero for non-cumlative. At first I thought my commit rate is fast enough I just never see the non-cumlative result, but after 100s of samples I still always get zero values. Here is the current output of my documentCache from Solr's admin for 1 core: - documentCache http://localhost:8983/solr/#/**channels_shard1_replica2/** plugins/cache?enhttp://localhost:8983/solr/#/channels_shard1_replica2/plugins/cache?en try=documentCache - class:org.apache.solr.search.**LRUCache - version:1.0 - description:LRU Cache(maxSize=512, initialSize=512, autowarmCount=150, regenerator=null) - src:$URL: https:/ /svn.apache.org/repos/asf/**lucene/dev/branches/lucene_** solr_4_3/http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/ solr/core/src/java/org/apache/**solr/search/LRUCache.java https://svn.apache.org/repos/**asf/lucene/dev/branches/** lucene_solr_4_3/shttps://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/s olr/core/src/java/org/apache/**solr/search/LRUCache.java $ - stats: - lookups:0 - hits:0 - hitratio:0.00 - inserts:0 - evictions:0 - size:0 - warmupTime:0 - cumulative_lookups:65198986 - cumulative_hits:63075669 - cumulative_hitratio:0.96 - cumulative_inserts:2123317 - cumulative_evictions:1010262 The
Re: dataconfig to index ZIP Files
Try setting dataSource=null for your toplevel entity and use filename=\.zip$ as filename selector. Am 28.06.2013 23:14, schrieb ericrs22: unfortunately not. I had tried that before with the logs saying: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 With .*zip i get this: WARN SimplePropertiesWriter Unable to read: dataimport.properties -- View this message in context: http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074009.html Sent from the Solr - User mailing list archive at Nabble.com.
Index pdf files.
Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler. Im using Solr-4.3.0. I followed the steps given in this post http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html However, I get the following error - Full Import failed:java.lang.NoClassDefFoundError: org/apache/tika/parser/Parser Please help! Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index pdf files.
The tika jars are not in your classpath. You need to add all the jars inside contrib/extraction/lib directory to your classpath. On Mon, Jul 1, 2013 at 2:00 PM, archit2112 archit2...@gmail.com wrote: Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler. Im using Solr-4.3.0. I followed the steps given in this post http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html However, I get the following error - Full Import failed:java.lang.NoClassDefFoundError: org/apache/tika/parser/Parser Please help! Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Stemming query in Solr
Hi Erick, Thanks for the reply. Here is what the situation is: Relevant portion of Solr Schema: lt;field name=Content type=text_general indexed=false stored=true required=true/gt; lt;field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/gt; lt;field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/gt; lt;copyField source=Content dest=ContentSearch/gt; lt;copyField source=Content dest=ContentSearchStemming/gt; lt;fieldType name=text_general class=solr.TextField positionIncrementGap=100gt; lt;analyzer type=indexgt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; lt;filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true /gt; lt;filter class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;analyzer type=querygt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; lt;filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true /gt; lt;filter class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt; lt;fieldType name=text_stem class=solr.TextField gt; lt;analyzergt; lt;tokenizer class=solr.WhitespaceTokenizerFactory/gt; lt;filter class=solr.SnowballPorterFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt; When I am indexing a document, the content gets stored as is in the Content field and gets copied over to ContentSearch and ContentSearchStemming for text based search and stemming search respectively. So, the ContentSearchStemming field does store the stem/reduced form of the terms. I have checked this with the Luke as well as the Admin Schema Browser --gt; Term Info. In the Admin Analysis screen, I have tested and found that if I index the text burning, it gets reduced to and stored as burn. So far so good. Now in the UI, lets say the user puts in the term burn and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term burn as well as for all terms which has their stem as burn i.e. burning, burned, burns, etc. lets say the user puts in the term burning and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term burning as well as for all terms which has their stem as burn i.e. burn, burned, burns, etc. The query that gets submitted to Solr: q=ContentSearchStemming:burning From Debug Info: lt;str name=rawquerystringgt;ContentSearchStemming:burninglt;/strgt; lt;str name=querystringgt;ContentSearchStemming:burninglt;/strgt; lt;str name=parsedquerygt;ContentSearchStemming:burnlt;/strgt; lt;str name=parsedquery_toStringgt;ContentSearchStemming:burnlt;/strgt; So, when the results are returned, I am only getting the hits highlighted with the term burn, though the same document contains terms like burning and burns. I thought that the stemming should work like this: The stemming filter in the queryanalyzer chain would reduce the input word to its stem. burning --gt; burn The query component should scan through the terms and match those terms for which it finds a match between the stem of the term with the stem of the input term. burns --gt; burn (matches) burning --gt; burn The first point is happening. But looks like it is executing the search for an exact text based match with the stem burn. Hence, burns or burned are not getting returned. Hope I was able to make myself clear. On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] lt;ml-node+s472066n4073901...@n3.nabble.comgt; wrote First, this is for the Java version, I hope it extends to C#. But in your configuration, when you're indexing the stemmer should be storing the reduced form in the index. Then, when searching, the search should be against the reduced term. To check this, try 1gt; Using the Admin/Analysis page to see what gets stored in your index and what your query is transformed to to insure that you're getting what you expect. If you want to get in deeper to the details, try 1gt; use, say, the TermsComponent or Admin/Schema Browser or Luke to look in your index and see what's actually there. 2gt; us amp;debug=query or Admin/Analysis to see what the query actually looks like. Both your use-cases should work fine just with reduction _unless_ the particular word you look for doesn't happen to trip the stemmer. By that I mean that since it's algorithmically based, there may be some edge cases that seem like they should be reduced that aren't. I don't know whether fisherman would reduce to fish for instance. So are you seeing things that really don't work as expected or are you just working from the docs? Because I really don't see why you wouldn't get what you want given your description. Best Erick On Fri, Jun 28, 2013 at 2:33 AM, snkar lt;[hidden email]gt; wrote: gt; We have a search system based on
Set spellcheck field on query time?
Hello together, we are currently working on a mutilanguage single core setup. During that I stumbled upon the question if it is possible to define different sources for the spellcheck. For now I only see the possibility to define different request handlers. Is it somehow possible to set the source field for the DirectSolrSpellChecker on querytime? Cheers timo [cid:image001.jpg@01CE764E.E6958B90] Timo Schmidt Entwickler (Dipl. Inf. FH) AOE GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.demailto:timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. This e-mail message may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail.
Sum as a Projection for Facet Queries
Hi, We have a need of finding the sum of a field for each facet.query. We have looked at StatsComponent http://wiki.apache.org/solr/StatsComponent but that supports only facet.field. Has anyone written a patch over StatsComponent that supports the same along with some performance measures? Is there any way we can do this using the Function Query - Sumhttp://wiki.apache.org/solr/FunctionQuery#sum ? -- Regards, Samarth
Multiple groups of boolean queries in a single query.
Hello friends, I have a schema which contains various types of records of three different categories for ease of management and for making a single query to fetch all the data. The fields are grouped into three different types of records. For example: fields type 1: field name=x_date type=tdate indexed=true stored=true/ field name=x_name type=tdate indexed=true stored=true/ field name=x_type type=tdate indexed=true stored=true/ fields type 2: field name=y_date type=tdate indexed=true stored=true/ field name=y_name type=string indexed=true stored=true/ field name=y_phone type=string indexed=true stored=true/ fields type 3: field name=z_date type=tdate indexed=true stored=true/ field name=z_type type=string indexed=true stored=true/ common partition field which identifies the category of the data record field name=xyz_category type=string indexed=true stored=true/ What should I do to fetch all these records in the form: (+x_date:[2011-01-01T00:00:00Z TO *] +x_type:(1 OR 2 OR 3 OR 4) +xyz_category:X) OR (+y_date:[2012-06-01T00:00:00Z TO *] +y_name:sam~ +xyz_category:Y) OR (+z_date:[2013-03-01T00:00:00Z TO *] +xyz_category:Z) Can we construct a query like this? Or is it even possible? Sam -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple groups of boolean queries in a single query.
My entire concern is to be able to make a single query to fetch all the types of records. If I had to create three different cores for this different types of data, I would have to make 3 calls to solr to fetch the entire set of data. And I will be having approx 15 such types in real. Also, at any given record, either the section 1 fields are filled up or section 2's or section 3's. At no point, will we have all these fields populated in a single record. Only field that will have data for all records is xyz_category to allow us to partition the data set. Any suggestions in writing a single query to fetch all the data we need will be highly appreciated. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294p4074296.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index pdf files.
Hi Thanks a lot. I did what you said. Now I'm getting the following error. Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Set spellcheck field on query time?
Check out http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.dictionary - you can define multiple dictionaries in the same handler, each with its own source field. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 1. juli 2013 kl. 11:34 skrev Timo Schmidt timo.schm...@aoemedia.de: Hello together, we are currently working on a mutilanguage single core setup. During that I stumbled upon the question if it is possible to define different sources for the spellcheck. For now I only see the possibility to define different request handlers. Is it somehow possible to set the source field for the DirectSolrSpellChecker on querytime? Cheers timo Timo Schmidt Entwickler (Dipl. Inf. FH) AOE GmbH Borsigstr. 3 65205 Wiesbaden Germany Tel. +49 (0) 6122 70 70 7 - 234 Fax. +49 (0) 6122 70 70 7 -199 e-Mail: timo.schm...@aoemedia.de Web: http://www.aoemedia.de/ Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a USt-ID Nr.: DE250247455 Handelsregister: Wiesbaden B Handelsregister Nr.: 22567 Stammsitz: Wiesbaden Creditreform: 625.0209354 Geschäftsführer: Kian Toyouri Gould Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. This e-mail message may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail.
Re: documentCache not used in 4.3.1?
Daniel: Soft commits invalidate the top level caches, which include things like filterCache, queryResultCache etc. Various segment-level caches are NOT invalidated, but you really don't have a lot of control from the Solr level over those anyway. But yeah, the tension between caching a bunch of stuff for query speedups and NRT is still with us. Soft commits are much less expensive than hard commits, but not being able to use the caches as much is the price. You're right that with such frequent autocommits, autowarming probably is not worth the effort. The question I always ask is whether 1 second is really necessary. Or, more accurately, worth the price. Often it's not and lengthening it out significantly may be an option, but that's a discussion for you to have with your product manager G I have seen configurations that have a more frequent hard commit (openSearcher=false) than soft commit. The mantra is soft commits are about visibility, hard commits are about durability. FWIW, Erick On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.comwrote: We see similar results, again we softCommit every 1s (trying to get as NRT as we can), and we very rarely get any hits in our caches. As an unscheduled test last week, we did shutdown indexing and noticed about 80% hit rate in caches (and average query time dropped from ~1s to 100ms!) so I think we are in the same position as you. I appreciate with such a frequent soft commit that the caches get invalidated, but I was expecting cache warming to help though it doesn't appear to be. We *don't* currently run a warming query, my impression of NRT was that it was better to not do that as otherwise you spend more time warming the searcher and caches, and by the time you've done all that, the searcher is invalidated anyway! On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote: That's a good idea, I'll try that next week. Thanks! Tim On 29/06/13 12:39 PM, Erick Erickson wrote: Tim: Yeah, this doesn't make much sense to me either since, as you say, you should be seeing some metrics upon occasion. But do note that the underlying cache only gets filled when getting documents to return in query results, since there's no autowarming going on it may come and go. But you can test this pretty quickly by lengthening your autocommit interval or just not indexing anything for a while, then run a bunch of queries and look at your cache stats. That'll at least tell you whether it works at all. You'll have to have hard commits turned off (or openSearcher set to 'false') for that check too. Best Erick On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Timtvaillanco...@ea.com * *wrote: Yes, we are softCommit'ing every 1000ms, but that should be enough time to see metrics though, right? For example, I still get non-cumulative metrics from the other caches (which are also throw away). I've also curl/sampled enough that I probably should have seen a value by now. If anyone else can reproduce this on 4.3.1 I will feel less crazy :). Cheers, Tim -Original Message- From: Erick Erickson [mailto:erickerickson@gmail.**com erickerick...@gmail.com ] Sent: Saturday, June 29, 2013 10:13 AM To: solr-user@lucene.apache.org Subject: Re: documentCache not used in 4.3.1? It's especially weird that the hit ratio is so high and you're not seeing anything in the cache. Are you perhaps soft committing frequently? Soft commits throw away all the top-level caches including documentCache I think Erick On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourttim@elementspace. **comt...@elementspace.com wrote: Thanks Otis, Yeah I realized after sending my e-mail that doc cache does not warm, however I'm still lost on why there are no other metrics. Thanks! Tim On 28 June 2013 16:22, Otis Gospodneticotis.gospodnetic@**gmail.com otis.gospodne...@gmail.com wrote: Hi Tim, Not sure about the zeros in 4.3.1, but in SPM we see all these numbers are non-0, though I haven't had the chance to confirm with Solr 4.3.1. Note that you can't really autowarm document cache... Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, This has to be a stupid question/I must be doing something wrong, but after frequent load testing with documentCache enabled under Solr 4.3.1 with autoWarmCount=150, I'm noticing that my documentCache metrics are always zero for non-cumlative. At first I thought my commit rate is fast enough I just never see the non-cumlative result, but after 100s of samples I still always get zero values. Here is the current output of my documentCache from Solr's admin for 1
Re: Index pdf files.
OK, have you done anything custom? You get this where? solr logs? Echoed back in the browser? In response to what command? You haven't provided enough info to help us help you. You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Mon, Jul 1, 2013 at 6:08 AM, archit2112 archit2...@gmail.com wrote: Hi Thanks a lot. I did what you said. Now I'm getting the following error. Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index pdf files.
I figured it out. It was a problem with the regular expression i used in data-config.xml . -- View this message in context: http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074304.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stemming query in Solr
bq: But looks like it is executing the search for an exact text based match with the stem burn. Right. You need to appreciate index time as opposed to query time stemming. Your field definition has both turned on. The admin/analysis page will help here G.. At index time, the terms are stemmed, and _only_ the reduced term is put in the index. At query time, the same thing happens and _only_ the reduced term is searched for. By stemming at index time, you lose the original form of the word, it's just gone and nothing about checking/unchecking the stem bits will recover it. So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches. I think I saw a clever approach to doing this involving a custom filter but can't find it now. As I recall it indexed the un-stemmed version like a synonym with some kind of marker to indicate exact match when necessary Best Erick On Mon, Jul 1, 2013 at 5:15 AM, snkar soumya@zoho.com wrote: Hi Erick, Thanks for the reply. Here is what the situation is: Relevant portion of Solr Schema: lt;field name=Content type=text_general indexed=false stored=true required=true/gt; lt;field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/gt; lt;field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/gt; lt;copyField source=Content dest=ContentSearch/gt; lt;copyField source=Content dest=ContentSearchStemming/gt; lt;fieldType name=text_general class=solr.TextField positionIncrementGap=100gt; lt;analyzer type=indexgt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; lt;filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true /gt; lt;filter class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;analyzer type=querygt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; lt;filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true /gt; lt;filter class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt; lt;fieldType name=text_stem class=solr.TextField gt; lt;analyzergt; lt;tokenizer class=solr.WhitespaceTokenizerFactory/gt; lt;filter class=solr.SnowballPorterFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt; When I am indexing a document, the content gets stored as is in the Content field and gets copied over to ContentSearch and ContentSearchStemming for text based search and stemming search respectively. So, the ContentSearchStemming field does store the stem/reduced form of the terms. I have checked this with the Luke as well as the Admin Schema Browser --gt; Term Info. In the Admin Analysis screen, I have tested and found that if I index the text burning, it gets reduced to and stored as burn. So far so good. Now in the UI, lets say the user puts in the term burn and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term burn as well as for all terms which has their stem as burn i.e. burning, burned, burns, etc. lets say the user puts in the term burning and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term burning as well as for all terms which has their stem as burn i.e. burn, burned, burns, etc. The query that gets submitted to Solr: q=ContentSearchStemming:burning From Debug Info: lt;str name=rawquerystringgt;ContentSearchStemming:burninglt;/strgt; lt;str name=querystringgt;ContentSearchStemming:burninglt;/strgt; lt;str name=parsedquerygt;ContentSearchStemming:burnlt;/strgt; lt;str name=parsedquery_toStringgt;ContentSearchStemming:burnlt;/strgt; So, when the results are returned, I am only getting the hits highlighted with the term burn, though the same document contains terms like burning and burns. I thought that the stemming should work like this: The stemming filter in the queryanalyzer chain would reduce the input word to its stem. burning --gt; burn The query component should scan through the terms and match those terms for which it finds a match between the stem of the term with the stem of the input term. burns --gt; burn (matches) burning --gt; burn The first point is happening. But looks like it is executing the search for an exact text based match with the stem burn. Hence, burns or burned are not getting returned. Hope I was able to make myself clear. On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] lt;ml-node+s472066n4073901...@n3.nabble.comgt; wrote First, this is for the Java version, I hope it extends to C#. But in your configuration, when you're indexing the stemmer should be storing the reduced form in the index. Then, when searching, the search should be against the reduced term. To check this, try 1gt; Using the
Re: Multiple groups of boolean queries in a single query.
Have you tried the query you indicated? Because it should just work barring syntax errors. The only other thing you might want is to turn on grouping by field type. That'll return separate sections by type, say the top 3 (default 1) documents in each type. If you don't group, you have the possibility that your entire results (i.e. the number of docs in the rows parameter) will be all one type. see: http://wiki.apache.org/solr/FieldCollapsing Best Erick On Mon, Jul 1, 2013 at 6:06 AM, samabhiK qed...@gmail.com wrote: My entire concern is to be able to make a single query to fetch all the types of records. If I had to create three different cores for this different types of data, I would have to make 3 calls to solr to fetch the entire set of data. And I will be having approx 15 such types in real. Also, at any given record, either the section 1 fields are filled up or section 2's or section 3's. At no point, will we have all these fields populated in a single record. Only field that will have data for all records is xyz_category to allow us to partition the data set. Any suggestions in writing a single query to fetch all the data we need will be highly appreciated. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294p4074296.html Sent from the Solr - User mailing list archive at Nabble.com.
Shard tolerant partial results
Hi, When doing distributed searches with shards.tolerant set whilst the hosts for a slice are down and therefore the response is partial, how best that inferred as we would like to not cache the results upstream and perhaps inform the end user in some way. I am aware that shards.info could be used, however I am concerned this may have performance implications due to cost parsing the response from solr and perhaps some extra cost incurred by solr to generate the response. Perhaps an http header could be added or another attribute added to the solr result node. Phil __ brightsolid is used in this email to collectively mean brightsolid online innovation limited and its subsidiary companies brightsolid online publishing limited and brightsolid online technology limited. findmypast.co.uk is a brand of brightsolid online publishing limited. brightsolid online innovation limited, Gateway House, Luna Place, Dundee Technology Park, Dundee DD2 1TP. Registered in Scotland No. SC274983. brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington Street, London EC2A 3DQ. Registered in England No. 04369607. brightsolid online technology limited, Gateway House, Luna Place, Dundee Technology Park, Dundee DD2 1TP. Registered in Scotland No. SC161678. Email Disclaimer This message is confidential and may contain privileged information. You should not disclose its contents to any other person. If you are not the intended recipient, please notify the sender named above immediately. It is expressly declared that this e-mail does not constitute nor form part of a contract or unilateral obligation. Opinions, conclusions and other information in this message that do not relate to the official business of brightsolid shall be understood as neither given nor endorsed by it. __ This email has been scanned by the brightsolid Email Security System. Powered by MessageLabs __
Unique key error while indexing pdf files
Hi Im trying to index pdf files in solr 4.3.0 using the data import handler. *My request handler - * requestHandler name=/dataimport1 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config1.xml/str /lst /requestHandler *My data-config1.xml * dataConfig dataSource type=BinFileDataSource / document entity name=f dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf recursive=true entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text field column=Author name=author meta=true/ field column=title name=title1 meta=true/ field column=text name=text/ /entity /entity /document /dataConfig Now When i try and index the files i get the following error - org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468) This problem can be solved easily in case of database indexing but i dont know how to go about the unique key of a document. how do i define the id field (unique key) of a pdf file. how do i solve this problem? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key error while indexing pdf files
It all depends on your data model - tell us more about your data model. For example, how will users or applications query these documents and what will they expect to be able to do with the ID/key for the documents? How are you expecting to identify documents in your data model? -- Jack Krupansky -Original Message- From: archit2112 Sent: Monday, July 01, 2013 7:17 AM To: solr-user@lucene.apache.org Subject: Unique key error while indexing pdf files Hi Im trying to index pdf files in solr 4.3.0 using the data import handler. *My request handler - * requestHandler name=/dataimport1 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config1.xml/str /lst /requestHandler *My data-config1.xml * dataConfig dataSource type=BinFileDataSource / document entity name=f dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf recursive=true entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text field column=Author name=author meta=true/ field column=title name=title1 meta=true/ field column=text name=text/ /entity /entity /document /dataConfig Now When i try and index the files i get the following error - org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468) This problem can be solved easily in case of database indexing but i dont know how to go about the unique key of a document. how do i define the id field (unique key) of a pdf file. how do i solve this problem? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key error while indexing pdf files
Im new to solr. Im just trying to understand and explore various features offered by solr and their implementations. I would be very grateful if you could solve my problem with any example of your choice. I just want to learn how i can index pdf documents using data import handler. -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field
Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in order to achieve distinct values in multivalued fields. Example below: updateRequestProcessorChain name=uniq_fields processor class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory lst name=fields strtitle/str strtag_type/str /lst /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainuniq_fields/str /lst /requestHandler However the data being is indexed one by one. This may happen, since a document may will get an additional tag in a future update. Unfortunately in order to ensure not having any duplicate tags, i was hoping, the UpdateProcessorFactory is doing what i want to achieve. In order to actually add a tag, i am sending an tag_type :{add:foo}, which still adds the tag, without questioning if its already part of the field. How may i be able to achieve distinct values on solr side?! -- View this message in context: http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key error while indexing pdf files
It's really 100% up to you how you want to come up with the unique key values for your documents. What would you like them to be? Just use that. Anything (within reason) - anything goes. But it also comes back to your data model. You absolutely must come up with a data model for how you expect to index and query data in Solr before you just start throwing random data into Solr. 1. Design your data model. 2. Produce a Solr schema from that data model. 3. Map the raw data from your data sources (e.g., PDF files) to the fields in your Solr schema. That last step includes the ID/key field, but your data model will imply any requirements for what the ID/key should be. To be absolutely clear, it is 100% up to you to design the ID/key for every document; Solr does NOT do that for you. Even if you are just exploring, at least come up with an exploratory data model - which includes what expectations you have about the unique ID/key for each document. So, for that first PDF file, what expectation (according to your data model) do you have for what its ID/key should be? -- Jack Krupansky -Original Message- From: archit2112 Sent: Monday, July 01, 2013 8:30 AM To: solr-user@lucene.apache.org Subject: Re: Unique key error while indexing pdf files Im new to solr. Im just trying to understand and explore various features offered by solr and their implementations. I would be very grateful if you could solve my problem with any example of your choice. I just want to learn how i can index pdf documents using data import handler. -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stemming query in Solr
So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches I am already indexing the text twice using the ContentSearch and ContentSearchStemming fields. But what this allows me is to return burning as well as burn if the user specifies burning as the input search term, burning being the exact match: ContentSearch:burning + ContentSearchStemming:burn(reduced from ContentSearchStemming:burning) What I cannot figure out is how is this going to help me in instructing Solr to execute the query for the different grammatical variations of the input search term stem i.e. stemming query for burning expands to text based query for burn, burns, burned, burning, etc. You mentioned something about synonym. This was also mentioned in the Solr Wiki: A related technology to stemming is lemmatization, which allows for stemming by expansion, taking a root word and 'expanding' it to all of its various forms. Lemmatization can be used either at insertion time or at query time. Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the SynonymFilterFactory I think what I need is exactly this point: Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the SynonymFilterFactory But I am not sure, how to go about it and exactly how can Synonym help me here as I am not looking for synonyms, rather different expansions of the stemmed word. On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene] lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote bq: But looks like it is executing the search for an exact text based match with the stem burn. Right. You need to appreciate index time as opposed to query time stemming. Your field definition has both turned on. The admin/analysis page will help here lt;Ggt;.. At index time, the terms are stemmed, and _only_ the reduced term is put in the index. At query time, the same thing happens and _only_ the reduced term is searched for. By stemming at index time, you lose the original form of the word, it's just gone and nothing about checking/unchecking the stem bits will recover it. So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches. I think I saw a clever approach to doing this involving a custom filter but can't find it now. As I recall it indexed the un-stemmed version like a synonym with some kind of marker to indicate exact match when necessary Best Erick On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: gt; Hi Erick, gt; gt; Thanks for the reply. gt; gt; Here is what the situation is: gt; gt; Relevant portion of Solr Schema: gt; amp;lt;field name=Content type=text_general indexed=false stored=true gt; required=true/amp;gt; gt; amp;lt;field name=ContentSearch type=text_general indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; gt; gt; amp;lt;fieldType name=text_general class=solr.TextField gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; amp;lt;tokenizer gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt gt; enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; amp;lt;analyzer gt; type=queryamp;gt; amp;lt;tokenizer class=solr.StandardTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer class=solr.WhitespaceTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; When I am indexing a document, the content gets stored as is in the gt; Content field and gets copied over to ContentSearch and gt; ContentSearchStemming for text based search and stemming search gt; respectively. So, the ContentSearchStemming field does store the gt; stem/reduced form of the terms. I have checked this with the Luke as well gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin gt; Analysis screen, I have tested and found that if I index the text gt; burning, it gets reduced to and stored as burn. So far so good.
Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field
Your stated problem seems to have nothing to do with the message subject line relating to RemoveDuplicatesTokenFilterFactory. Please start a new message thread unless you really are concerned with an issue related to RemoveDuplicatesTokenFilterFactory. This kind of thread hijacking is inappropriate for this email list (or any email list.) -- Jack Krupansky -Original Message- From: tuedel Sent: Monday, July 01, 2013 8:15 AM To: solr-user@lucene.apache.org Subject: Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in order to achieve distinct values in multivalued fields. Example below: updateRequestProcessorChain name=uniq_fields processor class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory lst name=fields strtitle/str strtag_type/str /lst /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainuniq_fields/str /lst /requestHandler However the data being is indexed one by one. This may happen, since a document may will get an additional tag in a future update. Unfortunately in order to ensure not having any duplicate tags, i was hoping, the UpdateProcessorFactory is doing what i want to achieve. In order to actually add a tag, i am sending an tag_type :{add:foo}, which still adds the tag, without questioning if its already part of the field. How may i be able to achieve distinct values on solr side?! -- View this message in context: http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stemming query in Solr
I was just wondering if another solution might work. If we are able to extract the stem of the input search term(maybe using a C# based stemmer, some open source implementation of the Porter algorithm) for cases where the stemming option is selected, and submit the query to solr as a multiple character wild card query with respect to the stem, it should return me all the different variations of the stemmed word. Example: Search Term: burning Stem: burn Modified Query: burn* Results: burn, burning, burns, burnt, etc. I am sure this is not the proper way of executing a stemming by expansion, but this might just get the job done. What do you think? Trying to think of test case where this will fail. On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene]lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote bq: But looks like it is executing the search for an exact text based match with the stem burn. Right. You need to appreciate index time as opposed to query time stemming. Your field definition has both turned on. The admin/analysis page will help here lt;Ggt;.. At index time, the terms are stemmed, and _only_ the reduced term is put in the index. At query time, the same thing happens and _only_ the reduced term is searched for. By stemming at index time, you lose the original form of the word, it's just gone and nothing about checking/unchecking the stem bits will recover it. So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches. I think I saw a clever approach to doing this involving a custom filter but can't find it now. As I recall it indexed the un-stemmed version like a synonym with some kind of marker to indicate exact match when necessary Best Erick On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: gt; Hi Erick, gt; gt; Thanks for the reply. gt; gt; Here is what the situation is: gt; gt; Relevant portion of Solr Schema: gt; amp;lt;field name=Content type=text_general indexed=false stored=true gt; required=true/amp;gt; gt; amp;lt;field name=ContentSearch type=text_general indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; gt; gt; amp;lt;fieldType name=text_general class=solr.TextField gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; amp;lt;tokenizer gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt gt; enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; amp;lt;analyzer gt; type=queryamp;gt; amp;lt;tokenizer class=solr.StandardTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer class=solr.WhitespaceTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; When I am indexing a document, the content gets stored as is in the gt; Content field and gets copied over to ContentSearch and gt; ContentSearchStemming for text based search and stemming search gt; respectively. So, the ContentSearchStemming field does store the gt; stem/reduced form of the terms. I have checked this with the Luke as well gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin gt; Analysis screen, I have tested and found that if I index the text gt; burning, it gets reduced to and stored as burn. So far so good. gt; gt; Now in the UI, gt; lets say the user puts in the term burn and checks the stemming option. gt; The expectation is that since the user has specified stemming, the results gt; should be returned for the term burn as well as for all terms which has gt; their stem as burn i.e. burning, burned, burns, etc. gt; lets say the user puts in the term burning and checks the stemming gt; option. The expectation is that since the user has specified stemming, the gt; results should be returned for the term burning as well as for all terms gt; which has their stem as burn i.e. burn, burned, burns, etc. gt; The query that gets submitted to Solr: q=ContentSearchStemming:burning gt; From Debug Info: gt; amp;lt;str name=rawquerystringamp;gt;ContentSearchStemming:burningamp;lt;/stramp;gt; gt; amp;lt;str
Re: Shard tolerant partial results
On Jul 1, 2013, at 6:56 AM, Phil Hoy p...@brightsolid.com wrote: Perhaps an http header could be added or another attribute added to the solr result node. I thought that was already done - I'm surprised that it's not. If that's really the case, please make a JIRA issue. - Mark
Distinct values in multivalued fields
Hello everybody, i have tried to make use of the UniqFieldsUpdateProcessorFactory in order to achieve distinct values in multivalued fields. Example below: updateRequestProcessorChain name=uniq_fields processor class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory lst name=fields strtitle/str strtag_type/str /lst /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainuniq_fields/str /lst /requestHandler However the data being is indexed one by one. This may happen, since a document may will get an additional tag in a future update. Unfortunately in order to ensure not having any duplicate tags, i was hoping, the UpdateProcessorFactory is doing what i want to achieve. In order to actually add a tag, i am sending an tag_type :{add:foo}, which still adds the tag, without questioning if its already part of the field. How may i be able to achieve distinct values on solr side?! In order to achieve this behavior i suggest writing an own processor might be a solution. However i am uncertain how to do and if it's the proper way. Imagine an incoming update - e.g. an update of an existing document having several multivalued fields without specifying add or set. This task would cause the corresponding document to get dropped and re-indexed without keeping any previously added values within the multivalued field. Therefore if a field is getting updated and not having the distinct value being part of the index yet, shall add the value, otherwise ignore it. The processor needs to define whether a field is getting added to the index or not in condition of the existing index. Is that achievable on Solr side?! Below my current pretty empty processor class: public class ConditionalSolrUniqFieldValuesProcessorFactory extends UpdateRequestProcessorFactory { @Override public UpdateRequestProcessor getInstance(SolrQueryRequest sqr, SolrQueryResponse sqr1, UpdateRequestProcessor urp) { return new ConditionalUniqFieldValuesProcessor(urp); } class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor { public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); CollectionString incomingFieldNames = doc.getFieldNames(); for (String t : incomingFieldNames) { /* is multivalued if (doc.getField(t).) { If multivalued and already part of index, drop from index. Otherwise add to multivalued field. } */ } } } } -- View this message in context: http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html Sent from the Solr - User mailing list archive at Nabble.com.
Converting nested data model to solr schema
Hi, I have the following data model: 1. Document (fields: doc_id, author, content) 2. Each Document has multiple attachment types. Each attachment type has multiple instances. And each attachment type may have different fields. for example: doc doc_id1/doc_id authorjohn/author contentsome long long text.../content file_attachments file_attachment attach_id458/attach_id attach_textSomeText/attach_text attach_date12/12/2012/attach_date /file_attachment file_attachment attach_id568/attach_id attach_textSomeText2/attach_text attach_date12/11/2012/attach_date /file_attachment /file_attachments reply_attachments reply_attachment reply_id345/reply_id reply_textSomeText/reply_text reply_authorJack/reply_author reply_date22-12-2012/reply_date /reply_attachment reply_attachment reply_id897/attach_id reply_textSomeText2/reply_text reply_authorBob/reply_author reply_date23-12-2012/reply_date /reply_attachment /reply_attachments I want to index all this data in solr cloud. My current solution is to index the original document by its self and index each attachment as a single solr document with its parent_doc_id, and then use solr join capability. The problem with this solution is that I must index all the attachments of each document, and the document itself in the same shard (current solr limitation). This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. My questions are: 1. Are my concerns regarding downside of overriding solr cloud's out-of-the-box mechanism justified? Or should I proceed with this solution? 2. If I'm looking for another solution, can I somehow keep all attachments on the same document and be able to query on a single attachment? A query example: Retrieve all documents where: content: contains abc AND reply_attachment.author = 'Bob' AND reply_attachment.date = '12-12-2012' Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Distinct values in multivalued fields
Have a look at the DedupUpdateProcessorFactory, which may help you. Although, I'm not sure if it works with multivalued fields. Upayavira On Mon, Jul 1, 2013, at 02:34 PM, tuedel wrote: Hello everybody, i have tried to make use of the UniqFieldsUpdateProcessorFactory in order to achieve distinct values in multivalued fields. Example below: updateRequestProcessorChain name=uniq_fields processor class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory lst name=fields strtitle/str strtag_type/str /lst /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainuniq_fields/str /lst /requestHandler However the data being is indexed one by one. This may happen, since a document may will get an additional tag in a future update. Unfortunately in order to ensure not having any duplicate tags, i was hoping, the UpdateProcessorFactory is doing what i want to achieve. In order to actually add a tag, i am sending an tag_type :{add:foo}, which still adds the tag, without questioning if its already part of the field. How may i be able to achieve distinct values on solr side?! In order to achieve this behavior i suggest writing an own processor might be a solution. However i am uncertain how to do and if it's the proper way. Imagine an incoming update - e.g. an update of an existing document having several multivalued fields without specifying add or set. This task would cause the corresponding document to get dropped and re-indexed without keeping any previously added values within the multivalued field. Therefore if a field is getting updated and not having the distinct value being part of the index yet, shall add the value, otherwise ignore it. The processor needs to define whether a field is getting added to the index or not in condition of the existing index. Is that achievable on Solr side?! Below my current pretty empty processor class: public class ConditionalSolrUniqFieldValuesProcessorFactory extends UpdateRequestProcessorFactory { @Override public UpdateRequestProcessor getInstance(SolrQueryRequest sqr, SolrQueryResponse sqr1, UpdateRequestProcessor urp) { return new ConditionalUniqFieldValuesProcessor(urp); } class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor { public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); CollectionString incomingFieldNames = doc.getFieldNames(); for (String t : incomingFieldNames) { /* is multivalued if (doc.getField(t).) { If multivalued and already part of index, drop from index. Otherwise add to multivalued field. } */ } } } } -- View this message in context: http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Converting nested data model to solr schema
Simply duplicate a subset of the fields that you want to query of the parent document on each child document and then you can directly query the child documents without any join. Yes, given the complexity of your data, a two-step query process may be necessary for some queries - do one query to get parent or child IDs and then do a second query filtered by those IDs. And, yes, this only approximates the full power of an SQL join - but at a tiny fraction of the cost. -- Jack Krupansky -Original Message- From: adfel70 Sent: Monday, July 01, 2013 9:56 AM To: solr-user@lucene.apache.org Subject: Converting nested data model to solr schema Hi, I have the following data model: 1. Document (fields: doc_id, author, content) 2. Each Document has multiple attachment types. Each attachment type has multiple instances. And each attachment type may have different fields. for example: doc doc_id1/doc_id authorjohn/author contentsome long long text.../content file_attachments file_attachment attach_id458/attach_id attach_textSomeText/attach_text attach_date12/12/2012/attach_date /file_attachment file_attachment attach_id568/attach_id attach_textSomeText2/attach_text attach_date12/11/2012/attach_date /file_attachment /file_attachments reply_attachments reply_attachment reply_id345/reply_id reply_textSomeText/reply_text reply_authorJack/reply_author reply_date22-12-2012/reply_date /reply_attachment reply_attachment reply_id897/attach_id reply_textSomeText2/reply_text reply_authorBob/reply_author reply_date23-12-2012/reply_date /reply_attachment /reply_attachments I want to index all this data in solr cloud. My current solution is to index the original document by its self and index each attachment as a single solr document with its parent_doc_id, and then use solr join capability. The problem with this solution is that I must index all the attachments of each document, and the document itself in the same shard (current solr limitation). This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. My questions are: 1. Are my concerns regarding downside of overriding solr cloud's out-of-the-box mechanism justified? Or should I proceed with this solution? 2. If I'm looking for another solution, can I somehow keep all attachments on the same document and be able to query on a single attachment? A query example: Retrieve all documents where: content: contains abc AND reply_attachment.author = 'Bob' AND reply_attachment.date = '12-12-2012' Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Distinct values in multivalued fields
Unfortunately, update processors only see the new, fresh, incoming data, not any existing document data. This is a case where your best bet may be to read the document first and then merge your new value into the existing list of values. -- Jack Krupansky -Original Message- From: tuedel Sent: Monday, July 01, 2013 9:34 AM To: solr-user@lucene.apache.org Subject: Distinct values in multivalued fields Hello everybody, i have tried to make use of the UniqFieldsUpdateProcessorFactory in order to achieve distinct values in multivalued fields. Example below: updateRequestProcessorChain name=uniq_fields processor class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory lst name=fields strtitle/str strtag_type/str /lst /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainuniq_fields/str /lst /requestHandler However the data being is indexed one by one. This may happen, since a document may will get an additional tag in a future update. Unfortunately in order to ensure not having any duplicate tags, i was hoping, the UpdateProcessorFactory is doing what i want to achieve. In order to actually add a tag, i am sending an tag_type :{add:foo}, which still adds the tag, without questioning if its already part of the field. How may i be able to achieve distinct values on solr side?! In order to achieve this behavior i suggest writing an own processor might be a solution. However i am uncertain how to do and if it's the proper way. Imagine an incoming update - e.g. an update of an existing document having several multivalued fields without specifying add or set. This task would cause the corresponding document to get dropped and re-indexed without keeping any previously added values within the multivalued field. Therefore if a field is getting updated and not having the distinct value being part of the index yet, shall add the value, otherwise ignore it. The processor needs to define whether a field is getting added to the index or not in condition of the existing index. Is that achievable on Solr side?! Below my current pretty empty processor class: public class ConditionalSolrUniqFieldValuesProcessorFactory extends UpdateRequestProcessorFactory { @Override public UpdateRequestProcessor getInstance(SolrQueryRequest sqr, SolrQueryResponse sqr1, UpdateRequestProcessor urp) { return new ConditionalUniqFieldValuesProcessor(urp); } class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor { public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); CollectionString incomingFieldNames = doc.getFieldNames(); for (String t : incomingFieldNames) { /* is multivalued if (doc.getField(t).) { If multivalued and already part of index, drop from index. Otherwise add to multivalued field. } */ } } } } -- View this message in context: http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html Sent from the Solr - User mailing list archive at Nabble.com.
How to re-index Solr get term frequency within documents
Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Re: ConcurrentUpdateSolrServer hanging
Hi, BlockUntilFinish block indefinitely sometimes. But if I send a commit from another thread to the instance, the concurrentUpdateServer unblock and send the rest of the documents and commit. So the squence look like this: 1. adding documents as usual... 2. finish adding documents... 3. block untill finished... block forever (i try to block before commit, call this commit 1) 4. from other thread, send a commit (lets call this commit 2) 5. magically unblocked... and flushed out the rest of the documents... 6. commit 1... 7. commit 2 ... The order of commit in 6 and 7 is observed in solr log. Thanks, Qun -- View this message in context: http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-hanging-tp4073620p4074366.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to re-index Solr get term frequency within documents
You can write any function query in the field list of the fl parameter. Sounds like you want termfreq: termfreq(field_arg,term) fl=id,a,b,c,termfreq(a,xyz) -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Monday, July 01, 2013 10:47 AM To: solr-user@lucene.apache.org Subject: How to re-index Solr get term frequency within documents Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Concurrent Modification Exception
Hi, I have recently upgraded from Solr 3.5 to 4.2.1. Also we have added spellcheck feature to our search query. During our performance testing we have observed that for every 2000 request, 1 request fails. The exception we observe in solr log are ConcurrentModificationException. Below is the complete stack for exception. Any idea what could potentially be the reason. I did check JIRA list in Solr/Lucene to see if there is any issue files and that's fixed. Couldn't filnd thats directly associated to LRUCache. thanks Aditya 2013-06-28 20:32:57,265 SEVERE [org.apache.solr.core.SolrCore] (http-80-20) java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at java.util.AbstractList.equals(AbstractList.java:506) at org.apache.solr.search.QueryResultKey.isEqual(QueryResultKey.java:96) at org.apache.solr.search.QueryResultKey.equals(QueryResultKey.java:81) at java.util.HashMap.getEntry(HashMap.java:349) at java.util.LinkedHashMap.get(LinkedHashMap.java:280) at org.apache.solr.search.LRUCache.get(LRUCache.java:130) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1276) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92) at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126) at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:567) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://lucene.472066.n3.nabble.com/Concurrent-Modification-Exception-tp4074371.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: documentCache not used in 4.3.1?
Regrettably, visibility is key for us :( Documents must be searchable as soon as they have been indexed (or as near as we can make it). Our old search system didn't do relevance sort, it was time-ordered (so it had a much simpler job) but it did have sub-second latency, and that is what is expected for its replacement (I know Solr doesn't like 1s currently, but we live in hope!). Tried explaining that by doing relevance sort we are searching 100% of the collection, instead of the ~10%-20% a time-ordered sort did (it effectively sharded by date and only searched as far back as it needed to fill a page of results), but that tends to get blank looks from business. :) One of life's little challenges. On 1 July 2013 11:10, Erick Erickson erickerick...@gmail.com wrote: Daniel: Soft commits invalidate the top level caches, which include things like filterCache, queryResultCache etc. Various segment-level caches are NOT invalidated, but you really don't have a lot of control from the Solr level over those anyway. But yeah, the tension between caching a bunch of stuff for query speedups and NRT is still with us. Soft commits are much less expensive than hard commits, but not being able to use the caches as much is the price. You're right that with such frequent autocommits, autowarming probably is not worth the effort. The question I always ask is whether 1 second is really necessary. Or, more accurately, worth the price. Often it's not and lengthening it out significantly may be an option, but that's a discussion for you to have with your product manager G I have seen configurations that have a more frequent hard commit (openSearcher=false) than soft commit. The mantra is soft commits are about visibility, hard commits are about durability. FWIW, Erick On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.com wrote: We see similar results, again we softCommit every 1s (trying to get as NRT as we can), and we very rarely get any hits in our caches. As an unscheduled test last week, we did shutdown indexing and noticed about 80% hit rate in caches (and average query time dropped from ~1s to 100ms!) so I think we are in the same position as you. I appreciate with such a frequent soft commit that the caches get invalidated, but I was expecting cache warming to help though it doesn't appear to be. We *don't* currently run a warming query, my impression of NRT was that it was better to not do that as otherwise you spend more time warming the searcher and caches, and by the time you've done all that, the searcher is invalidated anyway! On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote: That's a good idea, I'll try that next week. Thanks! Tim On 29/06/13 12:39 PM, Erick Erickson wrote: Tim: Yeah, this doesn't make much sense to me either since, as you say, you should be seeing some metrics upon occasion. But do note that the underlying cache only gets filled when getting documents to return in query results, since there's no autowarming going on it may come and go. But you can test this pretty quickly by lengthening your autocommit interval or just not indexing anything for a while, then run a bunch of queries and look at your cache stats. That'll at least tell you whether it works at all. You'll have to have hard commits turned off (or openSearcher set to 'false') for that check too. Best Erick On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim tvaillanco...@ea.com * *wrote: Yes, we are softCommit'ing every 1000ms, but that should be enough time to see metrics though, right? For example, I still get non-cumulative metrics from the other caches (which are also throw away). I've also curl/sampled enough that I probably should have seen a value by now. If anyone else can reproduce this on 4.3.1 I will feel less crazy :). Cheers, Tim -Original Message- From: Erick Erickson [mailto:erickerickson@gmail.**com erickerick...@gmail.com ] Sent: Saturday, June 29, 2013 10:13 AM To: solr-user@lucene.apache.org Subject: Re: documentCache not used in 4.3.1? It's especially weird that the hit ratio is so high and you're not seeing anything in the cache. Are you perhaps soft committing frequently? Soft commits throw away all the top-level caches including documentCache I think Erick On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourttim@elementspace. **comt...@elementspace.com wrote: Thanks Otis, Yeah I realized after sending my e-mail that doc cache does not warm, however I'm still lost on why there are no other metrics. Thanks! Tim On 28 June 2013 16:22, Otis Gospodneticotis.gospodnetic@** gmail.com otis.gospodne...@gmail.com wrote: Hi Tim, Not sure about the
Does solr cloud required passwordless ssh?
Hi Does solr cloud on a cluster of servers require passwordless ssh to be configured between the servers? -- View this message in context: http://lucene.472066.n3.nabble.com/Does-solr-cloud-required-passwordless-ssh-tp4074398.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: dataconfig to index ZIP Files
To answer the previous Post: I was not sure what datasource=binaryFile I took it from a PDF sample thinking that would help. after setting datasource=null I'm still gett the same errors... dataConfig dataSource type=BinFileDataSource user=svcSolr password=SomePassword / document entity name=Archive processor=FileListEntityProcessor baseDir=E:\ArchiveRoot fileName=.zip$ recursive=true rootEntity=false dataSource=null onError=skip field column=fileSize name=size/ field column=file name=filename/ /entity /document /dataConfig the logs report this: INFO - 2013-07-01 16:45:57.317; org.apache.solr.handler.dataimport.DataImporter; Starting Full Import WARN - 2013-07-01 16:45:57.333; org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read: dataimport.properties -- View this message in context: http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074399.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: cores sharing an instance
as for the second option: If you look inside SolrResourceLoader, you will notice that before a CoreContainer is created, a new class loader is also created line:111 this.classLoader = createClassLoader(null, parent); however, this parent object is always null, because it is called from: public SolrResourceLoader( String instanceDir ) { this( instanceDir, null, null ); } but if you were able to replace the second null (parent class loader) with a classloader of your own choice - ie. one that loads your singleton (but only that singleton, you don't want to share other objects), your cores should be able to see/share that object so, as you can see, if you test it and it works, you may fill a JIRA ticket and help other folks out there (i was too lazy and worked around it in the past - but that wasn't a good solution). If there a well justified reason to share objects, it seems weird the core is using 'null' as a parent class loader HTH, roman On Sun, Jun 30, 2013 at 2:18 PM, Peyman Faratin pey...@robustlinks.comwrote: I see. If I wanted to try the second option (find a place inside solr before the core is created) then where would that place be in the flow of app waking up? Currently what I am doing is each core loads its app caches via a requesthandler (in solrconfig.xml) that initializes the java class that does the loading. For instance: requestHandler name=/cachedResources class=solr.SearchHandler startup=lazy arr name=last-components strAppCaches/str /arr /requestHandler searchComponent name=AppCaches class=com.name.Project.AppCaches/ So each core has its own so specific cachedResources handler. Where in SOLR would I need to place the AppCaches code to make it visible to all other cores then? thank you Roman On Jun 29, 2013, at 10:58 AM, Roman Chyla roman.ch...@gmail.com wrote: Cores can be reloaded, they are inside solrcore loader /I forgot the exact name/, and they will have different classloaders /that's servlet thing/, so if you want singletons you must load them outside of the core, using a parent classloader - in case of jetty, this means writing your own jetty initialization or config to force shared class loaders. or find a place inside the solr, before the core is created. Google for montysolr to see the example of the first approach. But, unless you really have no other choice, using singletons is IMHO a bad idea in this case Roman On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote: its the singleton pattern, where in my case i want an object (which is RAM expensive) to be a centralized coordinator of application logic. thank you On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is very little shared between multiple cores (instanceDir paths, logging config maybe?). Why are you trying to do this? On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com wrote: Hi I have a multicore setup (in 4.3.0). Is it possible for one core to share an instance of its class with other cores at run time? i.e. At run time core 1 makes an instance of object O_i core 1 -- object O_i core 2 --- core n then can core K access O_i? I know they can share properties but is it possible to share objects? thank you -- Regards, Shalin Shekhar Mangar.
Re: Does solr cloud required passwordless ssh?
No, SolrCloud does not currently use ssh. - Mark On Jul 1, 2013, at 12:58 PM, adfel70 adfe...@gmail.com wrote: Hi Does solr cloud on a cluster of servers require passwordless ssh to be configured between the servers? -- View this message in context: http://lucene.472066.n3.nabble.com/Does-solr-cloud-required-passwordless-ssh-tp4074398.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to re-index Solr get term frequency within documents
Thanks Jack , it worked. Could you please provide some info on how to re-index existing data in Solr, after changing the schema.xml ? Thanks, Tony On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.comwrote: You can write any function query in the field list of the fl parameter. Sounds like you want termfreq: termfreq(field_arg,term) fl=id,a,b,c,termfreq(a,xyz) -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Monday, July 01, 2013 10:47 AM To: solr-user@lucene.apache.org Subject: How to re-index Solr get term frequency within documents Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/**select/?q=iphonefl=AuthorX%** 2CTitleX%2CCommentXdf=**CommentXwt=xmlindent=true** qt=tvrhtv=truetv.tf=truetv.**df=truetv.positionstv.**offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Re: dataconfig to index ZIP Files
IIRC Zip files are not supported On Mon, Jul 1, 2013 at 10:30 PM, ericrs22 ericr...@yahoo.com wrote: To answer the previous Post: I was not sure what datasource=binaryFile I took it from a PDF sample thinking that would help. after setting datasource=null I'm still gett the same errors... dataConfig dataSource type=BinFileDataSource user=svcSolr password=SomePassword / document entity name=Archive processor=FileListEntityProcessor baseDir=E:\ArchiveRoot fileName=.zip$ recursive=true rootEntity=false dataSource=null onError=skip field column=fileSize name=size/ field column=file name=filename/ /entity /document /dataConfig the logs report this: INFO - 2013-07-01 16:45:57.317; org.apache.solr.handler.dataimport.DataImporter; Starting Full Import WARN - 2013-07-01 16:45:57.333; org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read: dataimport.properties -- View this message in context: http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074399.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul
Re: dataconfig to index ZIP Files
I'm using the Tika plugin to do so and according to http://tika.apache.org/0.5/formats.html it does *ZIP archive (application/zip) Tika uses Java's built-in Zip classes to parse ZIP files. Support for ZIP was added in Tika 0.2.* -- View this message in context: http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074421.html Sent from the Solr - User mailing list archive at Nabble.com.
are fields stored or unstored by default xml
In schema.xml I know you can label a field as stored=false or stored=true, but if you say neither, which is it by default? Thank you Katie
Re: are fields stored or unstored by default xml
Haven't tried it recently, but is that even legal? Just be explicit :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 1, 2013 at 2:16 PM, Katie McCorkell katiemccork...@gmail.com wrote: In schema.xml I know you can label a field as stored=false or stored=true, but if you say neither, which is it by default? Thank you Katie
Re: How to re-index Solr get term frequency within documents
If all your fields are stored, you can do it with http://search-lucene.com/?q=solrentityprocessor Otherwise, just reindex the same way you indexed in the first place. *Always* be ready to reindex from scratch. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com wrote: Thanks Jack , it worked. Could you please provide some info on how to re-index existing data in Solr, after changing the schema.xml ? Thanks, Tony On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.comwrote: You can write any function query in the field list of the fl parameter. Sounds like you want termfreq: termfreq(field_arg,term) fl=id,a,b,c,termfreq(a,xyz) -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Monday, July 01, 2013 10:47 AM To: solr-user@lucene.apache.org Subject: How to re-index Solr get term frequency within documents Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/**select/?q=iphonefl=AuthorX%** 2CTitleX%2CCommentXdf=**CommentXwt=xmlindent=true** qt=tvrhtv=truetv.tf=truetv.**df=truetv.positionstv.**offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Hey Ahmet / Solr User Group, I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource DIH as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when doing an import this way. Here's my Get command against a Tab delimted file: (I removed server info and additional fields.. everything else is the same) http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields My response from solr ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime591/int/lst /response I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If I can get this to run correctly before running my entire collection of data. I initially loaded the first 1000 records to an empty core and that seemed to work, however, but when running the above with a csv file that has 10 records, I would like to see only 10 active records in my core. What I get instead, when looking at my stats page: numDocs 1000 maxDoc 1010 If I run the same url above while appending an 'optimize=true', I get: numDocs 1000, maxDoc 1000. Perhaps the commit=true is not doing what its supposed to or am I missing something? I also trying passing a commit afterward like this: http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't seem to do anything either) From: Ahmet Arslan iori...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Mike L. javaone...@yahoo.com Sent: Saturday, June 29, 2013 7:20 AM Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 Hi Mike, You could try http://wiki.apache.org/solr/UpdateCSV And make sure you commit at the very end. From: Mike L. javaone...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Saturday, June 29, 2013 3:15 AM Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 I've been working on improving index time with a JdbcDataSource DIH based config and found it not to be as performant as I'd hoped for, for various reasons, not specifically due to solr. With that said, I decided to switch gears a bit and test out FileDataSource setup... I assumed by eliminiating network latency, I should see drastic improvements in terms of import time..but I'm a bit surprised that this process seems to run much slower, at least the way I've initially coded it. (below) The below is a barebone file import that I wrote which consumes a tab delimited file. Nothing fancy here. The regex just seperates out the fields... Is there faster approach to doing this? If so, what is it? Also, what is the recommended approach in terms of index/importing data? I know thats may come across as a vague question as there are various options available, but which one would be considered the standard approach within a production enterprise environment. (below has been cleansed) dataConfig dataSource name=file type=FileDataSource / document entity name=entity1 processor=LineEntityProcessor url=[location_of_file]/file.csv dataSource=file transformer=RegexTransformer,TemplateTransformer field column=rawLine regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$ groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12 / /entity /document /dataConfig Thanks in advance, Mike Thanks in advance, Mike
Re: Classic 4.2 master-slave replication not completing
is it conceivable that there's too much traffic, causing Solr to stall re-opening the searcher (thus releasing to the new index)? I'm grasping at straws, and this is beginning to bug me a lot. The traffic logs wouldn't seem to support this (apart from periodic health-check pings, the load is distributed fairly evenly across 3 slaves by a load-balancer tool). After 35+ minutes this morning, none of the three successfully unstuck, and had to be manually core-reloaded. Is there perhaps a configuration element I'm overlooking that might make solr a bit less friendly about it, and just dump the searchers/reopen when replication completes? As a side note, I'm getting really frustrated with trying to get log4j logging on 4.3.1 set up; my tomcat container persists in complaining that it cannot find log4j.properties, when I've put it in the WEB-INF/classes of the war file, have SLF4j libraries AND log4j at the shared container lib level, and log4j.debug turned on. I can't find any excuses why it cannot seem to locate the configuration. Any suggestions or pointers would be greatly appreciated. Thanks! On Thu, Jun 27, 2013 at 10:35 AM, Mark Miller markrmil...@gmail.com wrote: Odd - looks like it's stuck waiting to be notified that a new searcher is ready. - Mark On Jun 27, 2013, at 8:58 AM, Neal Ensor nen...@gmail.com wrote: Okay, I have done this (updated to 4.3.1 across master and four slaves; one of these is my own PC for experiments, it is not being accessed by clients). Just had a minor replication this morning, and all three slaves are stuck again. Replication supposedly started at 8:40, ended 30 seconds later or so (on my local PC, set up identically to the other three slaves). The three slaves will NOT complete the roll-over to the new index. All three index folders have a write.lock and latest files are dated 8:40am (now it is 8:54am, with no further activity in the index folders). There exists an index.2013062708461 (or some variation thereof) in all three slaves' data folder. The seemingly-relevant thread dump of a snappuller thread on each of these slaves: - sun.misc.Unsafe.park(Native Method) - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) - java.util.concurrent.FutureTask.get(FutureTask.java:83) - org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631) - org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446) - org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317) - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223) - java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) - java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) - java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) - java.lang.Thread.run(Thread.java:662) Here they sit. My local PC slave replicated very quickly, switched over to the new generation (206) immediately. I am not sure why the three slaves are dragging on this. If there's any configuration elements or other details you need, please let me know. I can manually kick them by reloading the core from the admin pages, but obviously I would like this to be a hands-off process. Any help is greatly appreciated; this has been bugging me for some time now. On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: A bunch of replication related issues were fixed in 4.2.1 so you're better off upgrading to 4.2.1 or later (4.3.1 is the latest release). On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor nen...@gmail.com wrote: As a bit of background, we run a setup (coming from 3.6.1 to 4.2 relatively recently) with a single master receiving updates with three slaves pulling changes in. Our index is around 5 million documents, around 26GB in size total.
Perf. difference when the solr core is 'current' or not 'current'
in Solr's admin statistics page, there is a 'current' flag indicating whether the core index reader is 'current' or not. According to some discussions in this mailing list a few months back, it wouldn't affect anything. But my observation is completely different. When the current flag was not checked for some of the cores ( I have defined 15 cores in total), my median search latency over 48M records was over 190ms, but if every current flag was checked, the median dropped to only 87 ms. Another observation is, restarting solr instance may not necessarily make 'current' flags checked, have to reload cores even after starting solr. Could anybody explain the difference? I am using Datastax Enterprise 3.0.2 Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/Perf-difference-when-the-solr-core-is-current-or-not-current-tp4074438.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
On 7/1/2013 12:56 PM, Mike L. wrote: Hey Ahmet / Solr User Group, I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource DIH as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when doing an import this way. Here's my Get command against a Tab delimted file: (I removed server info and additional fields.. everything else is the same) http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields My response from solr ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime591/int/lst /response I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If I can get this to run correctly before running my entire collection of data. I initially loaded the first 1000 records to an empty core and that seemed to work, however, but when running the above with a csv file that has 10 records, I would like to see only 10 active records in my core. What I get instead, when looking at my stats page: numDocs 1000 maxDoc 1010 If I run the same url above while appending an 'optimize=true', I get: numDocs 1000, maxDoc 1000. A discrepancy between numDocs and maxDoc indicates that there are deleted documents in your index. You might already know this, so here's an answer to what I think might be your actual question: If you want to delete the 1000 existing documents before adding the 10 documents, then you have to actually do that deletion. The CSV update handler works at a lower level than the DataImport handler, and doesn't have clean or full-import options, which defaults to clean=true. The DIH is like a full application embedded inside Solr, one that uses an update handler -- it is not itself an update handler. When clean=true or using full-import without a clean option, DIH itself sends a delete all documents update request. If you didn't already know the bit about the deleted documents, then read this: It can be normal for indexing new documents to cause deleted documents. This happens when you have the same value in your UniqueKey field as documents that are already in your index. Solr knows by the config you gave it that they are the same document, so it deletes the old one before adding the new one. Solr has no way to know whether the document it already had or the document you are adding is more current, so it assumes you know what you are doing and takes care of the deletion for you. When you optimize your index, deleted documents are purged, which is why the numbers match there. Thanks, Shawn
Re: are fields stored or unstored by default xml
stored and indexed both default to true. This is legal: field name=alpha type=string / This detail will be in Early Access Release #2 of my book on Friday. -- Jack Krupansky -Original Message- From: Otis Gospodnetic Sent: Monday, July 01, 2013 2:21 PM To: solr-user@lucene.apache.org Subject: Re: are fields stored or unstored by default xml Haven't tried it recently, but is that even legal? Just be explicit :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 1, 2013 at 2:16 PM, Katie McCorkell katiemccork...@gmail.com wrote: In schema.xml I know you can label a field as stored=false or stored=true, but if you say neither, which is it by default? Thank you Katie
Re: are fields stored or unstored by default xml
On Mon, Jul 1, 2013 at 3:50 PM, Jack Krupansky j...@basetechnology.com wrote: stored and indexed both default to true. This is legal: field name=alpha type=string / Actually, for fields I believe the defaults come from the fieldType. The fieldType defaults to true for both indexed and stored if they are not specified there. -Yonik http://lucidworks.com
Re: Classic 4.2 master-slave replication not completing
On 7/1/2013 1:07 PM, Neal Ensor wrote: is it conceivable that there's too much traffic, causing Solr to stall re-opening the searcher (thus releasing to the new index)? I'm grasping at straws, and this is beginning to bug me a lot. The traffic logs wouldn't seem to support this (apart from periodic health-check pings, the load is distributed fairly evenly across 3 slaves by a load-balancer tool). After 35+ minutes this morning, none of the three successfully unstuck, and had to be manually core-reloaded. Is there perhaps a configuration element I'm overlooking that might make solr a bit less friendly about it, and just dump the searchers/reopen when replication completes? Can you share your solrconfig.xml file, someplace like http://apaste.info? Please be sure to choose the correct file type ... on that website it is (X)HTML for an XML file. As a side note, I'm getting really frustrated with trying to get log4j logging on 4.3.1 set up; my tomcat container persists in complaining that it cannot find log4j.properties, when I've put it in the WEB-INF/classes of the war file, have SLF4j libraries AND log4j at the shared container lib level, and log4j.debug turned on. I can't find any excuses why it cannot seem to locate the configuration. The wiki is still down for maintenance, so below is a relevant section of the SolrLogging wiki page extracted from Google Cache. When it comes back up, you can find it at this URL: http://wiki.apache.org/solr/SolrLogging#Switching_from_Log4J_back_to_JUL_.28java.util.logging.29 = The example logging setup takes over the configuration of Solr logging, which prevents the container from controlling where logs go. Users of containers other than the included Jetty (Tomcat in particular) may be accustomed to doing the logging configuration in the container. If you want to switch back to java.util.logging so this is once again possible, here's what to do. These steps apply to the example/lib/ext directory in the Solr example, or to your container's lib directory as mentioned in the previous section. These steps also assume that the slf4j version is 1.6.6, which comes with Solr4.3. Newer versions may use a different slf4j version. As of May 2013, you can use a newer SLF4J version with no trouble, but be aware that all slf4j components in your classpath must be the same version. Download slf4j version 1.6.6 (the version used in Solr4.3.x). http://www.slf4j.org/dist/slf4j-1.6.6.zip Unpack the slf4j archive. Delete these JARs from your lib folder: slf4j-log4j12-1.6.6.jar, jul-to-slf4j-1.6.6.jar, log4j-1.2.16.jar Add these JARs to your lib folder (from slf4j zip): slf4j-jdk14-1.6.6.jar, log4j-over-slf4j-1.6.6.jar Use your old logging.properties = Thanks, Shawn
Re: are fields stored or unstored by default xml
Correct - the field definitions inherit the attributes of the field type, and it is the field type that has the actual default values for indexed and stored (and other attributes.) -- Jack Krupansky -Original Message- From: Yonik Seeley Sent: Monday, July 01, 2013 3:56 PM To: solr-user@lucene.apache.org Subject: Re: are fields stored or unstored by default xml On Mon, Jul 1, 2013 at 3:50 PM, Jack Krupansky j...@basetechnology.com wrote: stored and indexed both default to true. This is legal: field name=alpha type=string / Actually, for fields I believe the defaults come from the fieldType. The fieldType defaults to true for both indexed and stored if they are not specified there. -Yonik http://lucidworks.com
Re: How to re-index Solr get term frequency within documents
Or, go with a commercial product that has a single-click Solr re-index capability, such as: 1. DataStax Enterprise - data is stored in Cassandra and reindexed into Solr from there. 2. LucidWorks Search - data sources are declared so that the package can automatically re-crawl the data sources. But, yeah, as Otis says, re-index is really just a euphemism for deleting your Solr data directory and indexing from scratch from the original data sources. -- Jack Krupansky -Original Message- From: Otis Gospodnetic Sent: Monday, July 01, 2013 2:26 PM To: solr-user@lucene.apache.org Subject: Re: How to re-index Solr get term frequency within documents If all your fields are stored, you can do it with http://search-lucene.com/?q=solrentityprocessor Otherwise, just reindex the same way you indexed in the first place. *Always* be ready to reindex from scratch. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com wrote: Thanks Jack , it worked. Could you please provide some info on how to re-index existing data in Solr, after changing the schema.xml ? Thanks, Tony On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.comwrote: You can write any function query in the field list of the fl parameter. Sounds like you want termfreq: termfreq(field_arg,term) fl=id,a,b,c,termfreq(a,xyz) -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Monday, July 01, 2013 10:47 AM To: solr-user@lucene.apache.org Subject: How to re-index Solr get term frequency within documents Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/**select/?q=iphonefl=AuthorX%** 2CTitleX%2CCommentXdf=**CommentXwt=xmlindent=true** qt=tvrhtv=truetv.tf=truetv.**df=truetv.positionstv.**offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Using per-segment FieldCache or DocValues in custom component?
I have some custom code that uses the top-level FieldCache (e.g., FieldCache.DEFAULT.getLongs(reader, foobar, false)). I'd like to redesign this to use the per-segment FieldCaches so that re-opening a Searcher is fast(er). In most cases, I've got a docId and I want to get the value for a particular single-valued field for that doc. Is there a good place to look to see example code of per-segment FieldCache use? I've been looking at PerSegmentSingleValuedFaceting, but hoping there might be something less confusing :) Also thinking DocValues might be a better way to go for me... is there any documentation or example code for that? -Michael
Re: Improving performance to return 2000+ documents
Thanks Erick/Jagdish. Just to give some background on my queries. 1. All my queries are unique. A query can be: ipod and ipod 8gb (but these are unique). These are about 1.2M in total. So, I assume setting a high queryResultCache, queryResultWindowSize and queryResultMaxDocsCached won't help. 2. I have this cache settings: documentCache class=solr.LRUCache size=1 initialSize=1 autowarmCount=0 cleanupThread=true/ //My understanding is, documentCache will help me the most because solr will cache documents retrieved. //Stats for documentCache: http://apaste.info/hknh queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0 cleanupThread=true/ //Default, since my queries are unique. filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ //Now sure how can I use filterCache, so I am keeping it as the default enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize100/queryResultWindowSize queryResultMaxDocsCached100/queryResultMaxDocsCached I think the question can also be framed as: How can I optimize solr response time for 50M product catalog for unique queries which retrieves 2000 documents in one go. I looked at a solr search component, I think writing a proxy around solr was easier, so I went ahead with this approach. Thanks, -Utkarsh On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote: Solrconfig.xml has got entries which you can tweak for your use case. One of them is queryresultwindowsize. You can try using the value of 2000 and see if it helps improving performance. Please make sure you have enough memory allocated for queryresultcache. A combination of sharding and distribution of workload(requesting 2000/number of shards) with an aggregator would be a good way to maximize performance. Thanks, Jagdish On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com wrote: 50M documents, depending on a bunch of things, may not be unreasonable for a single node, only testing will tell. But the question I have is whether you should be using standard Solr queries for this or building a custom component that goes at the base Lucene index and does the right thing. Or even re-indexing your entire corpus periodically to add this kind of data. FWIW, Erick On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com wrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick
Disable Document Id from being printed in the logs...
Hi all, I noticed that for Solr 4.2, when an internal call is made between two nodes Solr uses the list of matching document ids to fetch the document details. At this time, it prints out all matching document ids as a part of the query. Is there a way to suppress these log statements from being created? Thanks. Niran
Re: full-import failed after 5 hours with Exception: ORA-01555: snapshot too old: rollback segment number with name too small ORA-22924: snapshot too old
I would say definitely investigate the performance of the query, but also since you're using CachedSqlEntityProcessor, you might want to back off on the transaction isolation to READ_COMMITTED, which I think is the lowest one that Oracle supports: http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Fri, Jun 28, 2013 at 2:52 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I'd go talk to the DBA. How long does this query take if you run it directly against Oracle? How long if you run it locally vs. from a remove server (like Solr is in relation to your Oracle server(s)). What happens if you increase batchSize? Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Thu, Jun 27, 2013 at 6:41 PM, srinalluri nallurisr...@yahoo.com wrote: Hello, I am using Solr 4.3.2 and Oracle DB. The sub entity is using CachedSqlEntityProcessor. The dataSource is having batchSize=500. The full-import is failed with 'ORA-01555: snapshot too old: rollback segment number with name too small ORA-22924: snapshot too old' Exception after 5 hours. We already increased the undo space 4 times at the database end. Number of records in the jan_story table is 800,000 only. Tomcat is with 4GB JVM memory. Following is the entity (there are other sub-entities, I didn't mention them here. As the import failed with article_details entity. article_details is the first sub-entity) entity name=par8-article-testingprod dataSource=par8_prod pk=VCMID preImportDeleteQuery=content_type:article AND repository:par8qatestingprod query=select ID as VCMID from jan_story entity name=article_details dataSource=par8_prod transformer=TemplateTransformer,ClobTransformer,RegexTransformer query=select bb.recordid, aa.ID as DID,aa.STORY_TITLE, aa.STORY_HEADLINE, aa.SOURCE, aa.DECK, regexp_replace(aa.body, '\p\\[(pullquote|summary)\]\/p\|\[video [0-9]+?\]|\[youtube .+?\]', '') as BODY, aa.PUBLISHED_DATE, aa.MODIFIED_DATE, aa.DATELINE, aa.REPORTER_NAME, aa.TICKER_CODES,aa.ADVERTORIAL_CONTENT from jan_story aa,mapp bb where aa.id=bb.keystring1 cacheKey=DID cacheLookup=par8-article-testingprod.VCMID processor=CachedSqlEntityProcessor field column=content_type template=article / field column=RECORDID name=native_id / field column=repository template=par8qatestingprod / field column=STORY_TITLE name=title / field column=DECK name=description clob=true / field column=PUBLISHED_DATE name=date / field column=MODIFIED_DATE name=last_modified_date / field column=BODY name=body clob=true / field column=SOURCE name=source / field column=DATELINE name=dateline / field column=STORY_HEADLINE name=export_headline / /entity /entity The full-import without CachedSqlEntityProcessor is taking 7 days. That is why I am doing all this. -- View this message in context: http://lucene.472066.n3.nabble.com/full-import-failed-after-5-hours-with-Exception-ORA-01555-snapshot-too-old-rollback-segment-number-wd-tp4073822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Disable Document Id from being printed in the logs...
On 7/1/2013 3:24 PM, Niran Fajemisin wrote: I noticed that for Solr 4.2, when an internal call is made between two nodes Solr uses the list of matching document ids to fetch the document details. At this time, it prints out all matching document ids as a part of the query. Is there a way to suppress these log statements from being created? There's no way for Solr to distinguish between requests made by another Solr core and requests made by real clients. Paying attention to the IP address where the request originated won't work either - a lot of Solr installations run on the same hardware as the web server or other application that *uses* Solr. Debugging a problem becomes very difficult if you come up with *ANY* way to stop logging these requests. That said, on newer versions the parameter 'distrib=false' should be included on those requests that you don't want to log, so an option to turn off logging of non-distributed requests might be a reasonable idea. I think you'll run into some resistance, but as long as it doesn't default to enabled, it might be something that could be added. If you are worried about performance, update the logging configuration so that Solr only logs at WARN, that way no requests will be logged. If you then need to debug, you can change the logging to INFO using the admin UI, get your debugging done, and then turn it back down to WARN. This is the best logging approach from a performance perspective. Thanks, Shawn
Re: dataconfig to index ZIP Files
not sure if this will help any. Here's the verbose log INFO - 2013-07-01 23:17:08.632; org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration: tika-data-config.xml INFO - 2013-07-01 23:17:08.648; org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded successfully INFO - 2013-07-01 23:17:08.663; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={optimize=falseclean=falseindent=truecommit=falseverbose=trueentity=Archivecommand=full-importdebug=falsewt=json} status=0 QTime=31 INFO - 2013-07-01 23:17:08.663; org.apache.solr.handler.dataimport.DataImporter; Starting Full Import INFO - 2013-07-01 23:17:08.679; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720628679wt=json} status=0 QTime=0 INFO - 2013-07-01 23:17:08.679; org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read dataimport.properties INFO - 2013-07-01 23:17:09.552; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720629552wt=json} status=0 QTime=0 INFO - 2013-07-01 23:17:11.580; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720631577wt=json} status=0 QTime=0 INFO - 2013-07-01 23:17:13.593; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720633593wt=json} status=0 QTime=0 INFO - 2013-07-01 23:17:15.247; org.apache.solr.handler.dataimport.DocBuilder; Time taken = 0:0:6.553 INFO - 2013-07-01 23:17:15.247; org.apache.solr.update.processor.LogUpdateProcessor; [tika] webapp=/solr path=/dataimport params={optimize=falseclean=falseindent=truecommit=falseverbose=trueentity=Archivecommand=full-importdebug=falsewt=json} status=0 QTime=31 {} 0 31 INFO - 2013-07-01 23:17:15.621; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720635621wt=json} status=0 QTime=0 INFO - 2013-07-01 23:17:17.259; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720637256wt=json} status=0 QTime=0 INFO - 2013-07-01 23:17:17.649; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=truecommand=status_=1372720637645wt=json} status=0 QTime=0 -- View this message in context: http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074498.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Converting nested data model to solr schema
On Mon, Jul 1, 2013 at 5:56 PM, adfel70 adfe...@gmail.com wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Schema design for parent child field
from my experience deeply nested scopes is for SOLR-3076 almost only. On Sat, Jun 29, 2013 at 1:08 PM, Sperrink kevin.sperr...@lexisnexis.co.zawrote: Good day, I'm seeking some guidance on how best to represent the following data within a solr schema. I have a list of subjects which are detailed to n levels. Each document can contain many of these subject entities. As I see it if this had been just 1 subject per document, dynamic fields would have been a good resolution. Any suggestions on how best to create this structure in a denormalised fashion while maintaining the data integrity. For example a document could have: Subject level 1: contract Subject level 2: claims Subject level 1: patent Subject level 2: counter claims If I were to search for level 1 contract, I would only want the facet count for level 2 to contain claims and not counter claims. Any assistance in this would be much appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com