old searchers not closing after optimize or replication
Hello list, we have the problem that old searchers often are not closing after optimize (on master) or replication (on slaves) and therefore have huge index volumes. Only solution so far is to stop and start solr which cleans up everything successfully, but this can only be a workaround. Is the parameter waitSearcher=false an option to solve this? Any hints what to check or to debug? We use Apache Solr 3.1.0 on Linux. Regards Bernd
How could each core share configuration files
Hi all, Currently in my project , most of the core configurations are same(solrconfig.xml, dataimport.properties...), which are putted in their own folder as reduplicative. I am wondering how could I put common ones in one folder, which each core could share, and keep the different ones in their own folder still. Thanks Kun
Re: How could each core share configuration files
Perhaps this could help : http://lucene.472066.n3.nabble.com/Shared-conf-td2787771.html#a2789447 Ludovic. 2011/4/20 kun xiong [via Lucene] ml-node+2841801-1701787156-383...@n3.nabble.com Hi all, Currently in my project , most of the core configurations are same(solrconfig.xml, dataimport.properties...), which are putted in their own folder as reduplicative. I am wondering how could I put common ones in one folder, which each core could share, and keep the different ones in their own folder still. Thanks Kun -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/How-could-each-core-share-configuration-files-tp2841801p2841801.html To start a new topic under Solr - User, email ml-node+472068-1765922688-383...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/How-could-each-core-share-configuration-files-tp2841801p2841875.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Custom Sorting
Ok thank you for the discussion. As I thought regard to not possible within performance limits. I think the way to go is to document some more stats at index time, and use them in boost queries. :) Thanks Mike Date: Tue, 19 Apr 2011 15:12:00 -0400 Subject: Re: Custom Sorting From: erickerick...@gmail.com To: solr-user@lucene.apache.org As I understand it, sorting by field is what caches are all about. You have a big list in memory of all of the terms for a field, indexed by Lucene doc ID so fetching the term to compare by doc ID is fast, and also why the caches need to be warmed, and why sort fields should be single-valued. If you try to do this yourself and fetch data from each document, you can incur a huge performance hit, since you'll be seeking all over your disk... Score is special though since it's transient. Internally, all Lucene has to do is keep track of the top N scores encountered where N is something like start + queryResultWindowSize, this latter from solrconfig.xml, with no seeks to disk at all... Best Erick On Tue, Apr 19, 2011 at 2:50 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 4/19/2011 1:43 PM, Jan Høydahl wrote: Hi, Not possible :) Lucene compares each matching document against the query and produces a score for each. Documents are not compared to eachother like normal sort, that would be way too costly. That might be true for sort by 'score' (although even if you have all the scores, it still seems like some kind of sort must be neccesary to see which comes first), but when you sort by a field value, which is also possible, Lucene must be doing some kind of 'normal sort' algorithm, no? Ah, I guess it could just be using each term's position in the index, which is available in constant time, always kept track of in an index? Maybe, I don't know?
Re: TikaEntityProcessor
hi, i asked that :) didnt get that.. what dependencies? i am using solr 1.4 and tika 0.9 i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib also replaced old version of dataimporthandler-extras by apache-solr-dataimporthandler-extras-3.1.0.jar but still same problem.. someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1 -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html Sent from the Solr - User mailing list archive at Nabble.com.
Selecting (and sorting!) by the min/max value from multiple fields
Hello, short question is this - is there a way for a search to return a field that is not defined in the schema but is a minimal/maximum value of several (int/float) fields in solrDocument? (and how would that search look like?) Longer explanation. I have products and each of them can have a several prices (price for cash, price for credit cards, coupon price and so on) - not every product has all the price options. (Don't ask why - that's the use case:) ) field name=priceCash type=tfloat indexed=true stored=true / field name=priceCreditCard type=tfloat indexed=true stored=true / field name=priceCoupon type=tfloat indexed=true stored=true / +2 more Is there a way to ask give me the products containing for example 'sony' and in the results return me the minimal price of all possible prices (for each product) and SORT the results by that (minimal) price? I know I can calculate the minimal price at import/index time and store it in one separate field but the idea is that users will have checkboxes in which they could say - i'm only interested in products that have the priceCreditCard and priceCoupon, show me the smaller of those two and sort by that value. My idea is something like this: ?q=sonyminPrice:min(priceCash,priceCreditCard,priceCoupon...) (the field minPrice is not defined in schema but should return in the results) For searching this actually doesn't represent a problem as I can easily programatically compare the prices and present it to the user. The problem is sorting - I could do that also programatically but that would mean that I'd have to pull out all the results query returned (which can be quite big of course) and then sort them, so that a option I would naturally like to avoid. Don't know if I'm asking too much of solr:) but I can see usefulness of something like this in other examples other then mine. Hope the question is clear and if I'm going about things completely the wrong way please advise in the right direction. (If there is a similar question asked somewhere else please redirect me - i didn't find it) Help much appreciated! Josip -- View this message in context: http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selecting (and sorting!) by the min/max value from multiple fields
Hello, Have you tried reading : http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function From that page I would try something like : http://host:port/solr/select?q=sonysort=min(min(priceCash,priceCreditCard),priceCoupon)+ascrows=10indent=ondebugQuery=on Is that of any help ? -- Tanguy On 04/20/2011 09:41 AM, jmaslac wrote: Hello, short question is this - is there a way for a search to return a field that is not defined in the schema but is a minimal/maximum value of several (int/float) fields in solrDocument? (and how would that search look like?) Longer explanation. I have products and each of them can have a several prices (price for cash, price for credit cards, coupon price and so on) - not every product has all the price options. (Don't ask why - that's the use case:) ) field name=priceCash type=tfloat indexed=true stored=true / field name=priceCreditCard type=tfloat indexed=true stored=true / field name=priceCoupon type=tfloat indexed=true stored=true / +2 more Is there a way to ask give me the products containing for example 'sony' and in the results return me the minimal price of all possible prices (for each product) and SORT the results by that (minimal) price? I know I can calculate the minimal price at import/index time and store it in one separate field but the idea is that users will have checkboxes in which they could say - i'm only interested in products that have the priceCreditCard and priceCoupon, show me the smaller of those two and sort by that value. My idea is something like this: ?q=sonyminPrice:min(priceCash,priceCreditCard,priceCoupon...) (the field minPrice is not defined in schema but should return in the results) For searching this actually doesn't represent a problem as I can easily programatically compare the prices and present it to the user. The problem is sorting - I could do that also programatically but that would mean that I'd have to pull out all the results query returned (which can be quite big of course) and then sort them, so that a option I would naturally like to avoid. Don't know if I'm asking too much of solr:) but I can see usefulness of something like this in other examples other then mine. Hope the question is clear and if I'm going about things completely the wrong way please advise in the right direction. (If there is a similar question asked somewhere else please redirect me - i didn't find it) Help much appreciated! Josip -- View this message in context: http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Tanguy
Saravanan Chinnadurai/Actionimages is out of the office.
I will be out of the office starting 20/04/2011 and will not return until 21/04/2011. Please email to itsta...@actionimages.com for any urgent issues. Action Images is a division of Reuters Limited and your data will therefore be protected in accordance with the Reuters Group Privacy / Data Protection notice which is available in the privacy footer at www.reuters.com Registered in England No. 145516 VAT REG: 397000555
RE: How could each core share configuration files
I just use soft-links... Ephraim Ofir -Original Message- From: lboutros [mailto:boutr...@gmail.com] Sent: Wednesday, April 20, 2011 10:09 AM To: solr-user@lucene.apache.org Subject: Re: How could each core share configuration files Perhaps this could help : http://lucene.472066.n3.nabble.com/Shared-conf-td2787771.html#a2789447 Ludovic. 2011/4/20 kun xiong [via Lucene] ml-node+2841801-1701787156-383...@n3.nabble.com Hi all, Currently in my project , most of the core configurations are same(solrconfig.xml, dataimport.properties...), which are putted in their own folder as reduplicative. I am wondering how could I put common ones in one folder, which each core could share, and keep the different ones in their own folder still. Thanks Kun -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/How-could-each-core-share-configurati on-files-tp2841801p2841801.html To start a new topic under Solr - User, email ml-node+472068-1765922688-383...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=u nsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0 Mzk2MDUxNjE=. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/How-could-each-core-share-configurati on-files-tp2841801p2841875.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selecting (and sorting!) by the min/max value from multiple fields
Tanguy, thanks for the anwser. Yes I have already tried that but the problem is that min() function is not yet available (it is set for Solr 3.2). :( Btw. in my original post I've asked if the query could in the results return a new field with this computed minimal value - that is redudant, I'm only interested in sorting part of the question. Tanguy Moal wrote: Hello, Have you tried reading : http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function From that page I would try something like : http://host:port/solr/select?q=sonysort=min(min(priceCash,priceCreditCard),priceCoupon)+ascrows=10indent=ondebugQuery=on Is that of any help ? -- Tanguy On 04/20/2011 09:41 AM, jmaslac wrote: Hello, short question is this - is there a way for a search to return a field that is not defined in the schema but is a minimal/maximum value of several (int/float) fields in solrDocument? (and how would that search look like?) Longer explanation. I have products and each of them can have a several prices (price for cash, price for credit cards, coupon price and so on) - not every product has all the price options. (Don't ask why - that's the use case:) ) field name=priceCash type=tfloat indexed=true stored=true / field name=priceCreditCard type=tfloat indexed=true stored=true / field name=priceCoupon type=tfloat indexed=true stored=true / +2 more Is there a way to ask give me the products containing for example 'sony' and in the results return me the minimal price of all possible prices (for each product) and SORT the results by that (minimal) price? I know I can calculate the minimal price at import/index time and store it in one separate field but the idea is that users will have checkboxes in which they could say - i'm only interested in products that have the priceCreditCard and priceCoupon, show me the smaller of those two and sort by that value. My idea is something like this: ?q=sonyminPrice:min(priceCash,priceCreditCard,priceCoupon...) (the field minPrice is not defined in schema but should return in the results) For searching this actually doesn't represent a problem as I can easily programatically compare the prices and present it to the user. The problem is sorting - I could do that also programatically but that would mean that I'd have to pull out all the results query returned (which can be quite big of course) and then sort them, so that a option I would naturally like to avoid. Don't know if I'm asking too much of solr:) but I can see usefulness of something like this in other examples other then mine. Hope the question is clear and if I'm going about things completely the wrong way please advise in the right direction. (If there is a similar question asked somewhere else please redirect me - i didn't find it) Help much appreciated! Josip -- View this message in context: http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Tanguy -- View this message in context: http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2842232.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: KStemmer for Solr 3.x +
Seems like it isn't. In my installation (1.4.1) i used LucidKStemFilterFactory, and when switching the solr.war file to the 3.1 one i get: 14:42:31.664 ERROR [pool-1-thread-1]: java.lang.AbstractMethodError: org.apache.lucene.analysis.TokenStream.incrementToken()Z at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78) at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:606) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:151) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) at org.apache.solr.search.QParser.getQuery(QParser.java:142) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:84) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:52) at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1169) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) when the config is: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ !--filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/-- filter class=solr.StopFilterFactory ignoreCase=true words=old_stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 preserveOriginal=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnNumerics=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ !--filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/-- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ /analyzer anybody familiar with this issue? On Sat, Apr 9, 2011 at 7:00 AM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I see no reason why it would not be compatible. - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/KStemmer-for-Solr-3-x-tp2796594p2798213.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: old searchers not closing after optimize or replication
Does this persist? In other words, if you just watch it for some time, does the disk usage go back to normal? Because it's typical that your index size will temporarily spike after the operations you describe as new searchers are warmed up. During that interval, both the old and new searchers are open. Look particularly at your warmup time in the Solr admin page, that should give you an indication of how long it takes your warmup to happen and give you a clue about when you should expect the index sizes to drop again. How often do you optimize on the master and replicate on the slave? Because you may be getting into the runaway warmup problem where a new searcher is opened before the last one is autowarmed and spiraling out of control. Hope that helps Erick On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hello list, we have the problem that old searchers often are not closing after optimize (on master) or replication (on slaves) and therefore have huge index volumes. Only solution so far is to stop and start solr which cleans up everything successfully, but this can only be a workaround. Is the parameter waitSearcher=false an option to solve this? Any hints what to check or to debug? We use Apache Solr 3.1.0 on Linux. Regards Bernd
Re: old searchers not closing after optimize or replication
Hi Erik, Am 20.04.2011 13:56, schrieb Erick Erickson: Does this persist? In other words, if you just watch it for some time, does the disk usage go back to normal? Only after restarting the whole solr the disk usage goes back to normal. Because it's typical that your index size will temporarily spike after the operations you describe as new searchers are warmed up. During that interval, both the old and new searchers are open. Temporarily yes, but still after a couple of hours after optimize or replication? Look particularly at your warmup time in the Solr admin page, that should give you an indication of how long it takes your warmup to happen and give you a clue about when you should expect the index sizes to drop again. We have newSearcher and firstSearcher (both with 2 simple queries) and useColdSearcherfalse/useColdSearcher maxWarmingSearchers2/maxWarmingSearchers The QTime is less than 500 (0.5 second). warmupTime=0 for all autowarming Searcher How often do you optimize on the master and replicate on the slave? Because you may be getting into the runaway warmup problem where a new searcher is opened before the last one is autowarmed and spiraling out of control. We commit new content about every hour and do an optimze once a day. So replication is also once a day after optimize finished and system has settled down. No commit during optimize and replication. Any further hints? Hope that helps Erick On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hello list, we have the problem that old searchers often are not closing after optimize (on master) or replication (on slaves) and therefore have huge index volumes. Only solution so far is to stop and start solr which cleans up everything successfully, but this can only be a workaround. Is the parameter waitSearcher=false an option to solve this? Any hints what to check or to debug? We use Apache Solr 3.1.0 on Linux. Regards Bernd
Re: old searchers not closing after optimize or replication
H, this isn't right. You've pretty much eliminated the obvious things. What does lsof show? I'm assuming it shows the files are being held open by your Solr instance, but it's worth checking. I'm not getting the same behavior, admittedly on a Windows box. The only other thing I can think of is that you have a query that's somehow never ending, but that's grasping at straws. Do your log files show anything interesting? Best Erick@NotMuchHelpIKnow On Wed, Apr 20, 2011 at 8:37 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi Erik, Am 20.04.2011 13:56, schrieb Erick Erickson: Does this persist? In other words, if you just watch it for some time, does the disk usage go back to normal? Only after restarting the whole solr the disk usage goes back to normal. Because it's typical that your index size will temporarily spike after the operations you describe as new searchers are warmed up. During that interval, both the old and new searchers are open. Temporarily yes, but still after a couple of hours after optimize or replication? Look particularly at your warmup time in the Solr admin page, that should give you an indication of how long it takes your warmup to happen and give you a clue about when you should expect the index sizes to drop again. We have newSearcher and firstSearcher (both with 2 simple queries) and useColdSearcherfalse/useColdSearcher maxWarmingSearchers2/maxWarmingSearchers The QTime is less than 500 (0.5 second). warmupTime=0 for all autowarming Searcher How often do you optimize on the master and replicate on the slave? Because you may be getting into the runaway warmup problem where a new searcher is opened before the last one is autowarmed and spiraling out of control. We commit new content about every hour and do an optimze once a day. So replication is also once a day after optimize finished and system has settled down. No commit during optimize and replication. Any further hints? Hope that helps Erick On Wed, Apr 20, 2011 at 2:36 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hello list, we have the problem that old searchers often are not closing after optimize (on master) or replication (on slaves) and therefore have huge index volumes. Only solution so far is to stop and start solr which cleans up everything successfully, but this can only be a workaround. Is the parameter waitSearcher=false an option to solve this? Any hints what to check or to debug? We use Apache Solr 3.1.0 on Linux. Regards Bernd
Solr - Multi Term highlighting issue
Hello, I am dealing with a highlighting issue in SOLR, I will try to explain the issue. When I search for a single term in solr, it wraps em tag around the words I want to highlight, all works well. But if I search multiple term, for most part highlighting works good and then for some of the terms, the highlight return multiple terms in a sing em tag ... emsrchtrm1) brbp srchtrm2/em I expect solr to return highlight terms like... emsrchtrm1/em) brbp... emsrchtrm2/em When I search for 'US mec chile', here is how my result appears ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES: We had ... with emUS/em and emChile/em ..., (emMEC)/b/pp/ppbUS/em This is what I was expecting it to be ... Corboba. (emMEC/em)/b/pp/ppbemCHILE/em/FOREST FIRES: We had ... with emUS/em and emChile/em ..., (emMEC/em)/b/pp/ppbemUS/em Here is my query params - response - lst name=responseHeader int name=status0/int int name=QTime26/int - lst name=params str name=hl.fragsize10/str str name=explainOther / str name=indenton/str str name=hl.flstory, slug/str str name=wtstandard/str str name=hlon/str str name=rows10/str str name=version2.2/str str name=hl.highlightMultiTermtrue/str str name=fl*/str str name=start0/str str name=qmec us chile/str str name=qtstandard/str str name=hl.usePhraseHighlightertrue/str str name=fqstoryid= X/str /lst /lst Here are some other links I found in the forum, but no real conclusion http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_hi ghlighting_question#78163c42a67cb533 I am going to try this patch, which also had no conclusive results https://issues.apache.org/jira/browse/SOLR-1394 Has anyone come across this issue? Any suggestions on how to fix this issue is much appreciated. thanks regards, Rajesh Ramana
Re: old searchers not closing after optimize or replication
Hi Erik, Am 20.04.2011 15:42, schrieb Erick Erickson: H, this isn't right. You've pretty much eliminated the obvious things. What does lsof show? I'm assuming it shows the files are being held open by your Solr instance, but it's worth checking. Just commited new content 3 times and finally optimized. Again having old index files left. Then checked on my master, only the newest version of index files are listed with lsof. No file handles to the old index files but the old index files remain in data/index/. Thats strange. This time replication worked fine and cleaned up old index on slaves. I'm not getting the same behavior, admittedly on a Windows box. The only other thing I can think of is that you have a query that's somehow never ending, but that's grasping at straws. Do your log files show anything interesting? Lets see: - it has the old generation (generation=12) and its files - and recognizes that there have been several commits (generation=18) 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, _3xm.fdx, segment s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq] commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, _3xo.tis, _3xp.pr x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, _3xn.fdt, _3x p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, _3xo.fdt, _3xp.fr q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, _3xs.tis, _3x m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt] 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1302159868447 - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets the SolrDeletionPolicy onCommit and has the new generation 19 listed. 20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=3 commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, _3xm.fdx, segment s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq] commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, _3xo.tis, _3xp.pr x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, _3xn.fdt, _3x p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, _3xo.fdt, _3xp.fr q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, _3xs.tis, _3x m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt] commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,generation=19,filenames=[_3xt.fnm, _3xt.nrm, _3xt.fr q, _3xt.fdt, _3xt.tis, _3xt.fdx, segments_j, _3xt.prx, _3xt.tii] 20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1302159868449 - it starts a new searcher and warms it up - it sends SolrIndexSearcher close 20.04.2011 14:49:29 org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@2c37425f main 20.04.2011 14:49:29 org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush 20.04.2011 14:49:29 org.apache.solr.search.SolrIndexSearcher warm ... 20.04.2011 14:49:29 org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@2c37425f main 20.04.2011 14:49:29 org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=null params={facet=truestart=0event=newSearcherq=solrfacet.limit=100facet.field=f_dcyearrows=10} hits=96 status=0 QTime=816 20.04.2011 14:49:30 org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=null params={facet=truestart=0event=newSearcherq=*:*facet.limit=100facet.field=f_dcyearrows=10} hits=27826100 status=0 QTime=633 20.04.2011 14:49:30 org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. 20.04.2011 14:49:30 org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher Searcher@2c37425f main 20.04.2011
Re: TikaEntityProcessor
I went unsuccessfully down this path - too many incompatibilities among versions - some code changes and recompiling required. See also thread Solr 1.4.1 and Tika 0.9 - some tests not passing for remaining issues. You'll have better luck with the newer Solr 3.1 release, which already uses Tika 0.8 - still re-compiled from code (no changes as far as I remember) - never tried the library replacement - don't think it's possible. Andreas From: firdous_kind86 naturelov...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, April 20, 2011 12:38:02 AM Subject: Re: TikaEntityProcessor hi, i asked that :) didnt get that.. what dependencies? i am using solr 1.4 and tika 0.9 i replaced tika-core 0.9 and tika-parsers 0.9 at /contrib/extraction/lib also replaced old version of dataimporthandler-extras by apache-solr-dataimporthandler-extras-3.1.0.jar but still same problem.. someone pointed bug SOLR-2116 to me but i guess it is only for solr-3.1 -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2841936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TikaEntityProcessor
after reading this post i hoped that i could achieve.. but couldnt find any success in almost a week http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html#a867572 -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-tp2839188p2843084.html Sent from the Solr - User mailing list archive at Nabble.com.
Multiple Tags and Facets
Hello, I watched an online video with Chris Hostsetter from Lucidimagination. He showed the possibility of having some Facets that exclude *all* filter while also having some Facets that take care of some of the set filters while ignoring other filters. Unfortunately the Webinar did not explain how they made this and I wasn't able to give a filter/facet more than one tag. Here is an example: Facets and Filters: DocType, Author Facet: - Author -- George (10) -- Brian (12) -- Christian (78) -- Julia (2) -Doctype -- PDF (70) -- ODT (10) -- Word (20) -- JPEG (1) -- PNG (1) When clicking on Julia I would like to achieve the following: Facet: - Author -- George (10) -- Brian (12) -- Christian (78) -- Julia (2) Julia's Doctypes: -- JPEG (1) -- PNG (1) -Doctype -- PDF (70) -- ODT (10) -- Word (20) -- JPEG (1) -- PNG (1) Another example which adds special options to your GUI could be as following: Imagine a fashion store. If you search for shirt you get a color-facet: colors: - red (19) - green (12) - blue (4) - black (2) As well as a brand-facet: brands: - puma (18) - nike (19) When I click on the red color-facet, I would like to get the following back: colors: - red (19) - green (12)* - blue (4)* - black (2)* brands: - puma (18)* - nike (19) All those filters marked by an * could be displayed half-transparent or so - they just show the user that those filter-options exist for his/her search but aren't included in the result-set, since he/she excluded them by clicking the red filter. This case is more interesting, if not all red shirts were from nike. This way you can show the user that i.e. 8 of 19 red - shirts are from the brand you selected/you see 8 of 19 red shirts. I hope I explained what I want to achive. Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: old searchers not closing after optimize or replication
It looks OK, but still doesn't explain keeping the old files around. What is your deletionPolicy in your solrconfig.xml look like? It's possible that you're seeing Solr attempt to keep around several optimized copies of the index, but that still doesn't explain why restarting Solr removes them unless the deletionPolicy gets invoked on sometime and you're index files are aging out (I don't know the internals of deletion well enough to say). About optimization. It's become less important with recent code. Once upon a time, it made a substantial difference in search speed. More recently, it has very little impact on search speed, and is used much more sparingly. Its greatest benefit is reclaiming unused resources left over from deleted documents. So you might want to avoid the pain of optimizing (44 minutes!) and only optimize rarely of if you have deleted a lot of documents. It might be worthwhile to try (with a smaller index !) a bunch of optimize cycles and see if the deletionPolicy idea has any merit. I'd expect your index to reach a maximum and stay there after the saved copies of the index was reached... But otherwise I'm puzzled... Erick On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi Erik, Am 20.04.2011 15:42, schrieb Erick Erickson: H, this isn't right. You've pretty much eliminated the obvious things. What does lsof show? I'm assuming it shows the files are being held open by your Solr instance, but it's worth checking. Just commited new content 3 times and finally optimized. Again having old index files left. Then checked on my master, only the newest version of index files are listed with lsof. No file handles to the old index files but the old index files remain in data/index/. Thats strange. This time replication worked fine and cleaned up old index on slaves. I'm not getting the same behavior, admittedly on a Windows box. The only other thing I can think of is that you have a query that's somehow never ending, but that's grasping at straws. Do your log files show anything interesting? Lets see: - it has the old generation (generation=12) and its files - and recognizes that there have been several commits (generation=18) 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, _3xm.fdx, segment s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq] commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, _3xo.tis, _3xp.pr x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, _3xn.fdt, _3x p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, _3xo.fdt, _3xp.fr q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, _3xs.tis, _3x m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt] 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1302159868447 - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets the SolrDeletionPolicy onCommit and has the new generation 19 listed. 20.04.2011 14:49:25 org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=3 commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm, _3xm.fdx, segment s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq] commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm, _3xo.tis, _3xp.pr x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx, _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx, _3xn.fdt, _3x p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii, _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis, _3xo.fdt, _3xp.fr q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii, _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx, _3xs.tis, _3x m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm, _3xr.fdt] commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_j,version=1302159868449,generation=19,filenames=[_3xt.fnm, _3xt.nrm, _3xt.fr q, _3xt.fdt, _3xt.tis, _3xt.fdx, segments_j, _3xt.prx, _3xt.tii]
Re: Solr - Multi Term highlighting issue
Does your configuration have hl.mergeContiguous set to true by any chance? And what happens if you explicitly set this to false on your query? Best Erick On Wed, Apr 20, 2011 at 9:43 AM, Ramanathapuram, Rajesh rajesh.ramanathapu...@turner.com wrote: Hello, I am dealing with a highlighting issue in SOLR, I will try to explain the issue. When I search for a single term in solr, it wraps em tag around the words I want to highlight, all works well. But if I search multiple term, for most part highlighting works good and then for some of the terms, the highlight return multiple terms in a sing em tag ... emsrchtrm1) brbp srchtrm2/em I expect solr to return highlight terms like ... emsrchtrm1/em) brbp... emsrchtrm2/em When I search for 'US mec chile', here is how my result appears ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES: We had ... with emUS/em and emChile/em ..., (emMEC)/b/pp/ppbUS/em This is what I was expecting it to be ... Corboba. (emMEC/em)/b/pp/ppbemCHILE/em/FOREST FIRES: We had ... with emUS/em and emChile/em ..., (emMEC/em)/b/pp/ppbemUS/em Here is my query params - response - lst name=responseHeader int name=status0/int int name=QTime26/int - lst name=params str name=hl.fragsize10/str str name=explainOther / str name=indenton/str str name=hl.flstory, slug/str str name=wtstandard/str str name=hlon/str str name=rows10/str str name=version2.2/str str name=hl.highlightMultiTermtrue/str str name=fl*/str str name=start0/str str name=qmec us chile/str str name=qtstandard/str str name=hl.usePhraseHighlightertrue/str str name=fqstoryid= X/str /lst /lst Here are some other links I found in the forum, but no real conclusion http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_hi ghlighting_question#78163c42a67cb533 I am going to try this patch, which also had no conclusive results https://issues.apache.org/jira/browse/SOLR-1394 Has anyone come across this issue? Any suggestions on how to fix this issue is much appreciated. thanks regards, Rajesh Ramana
HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException
Hi all, i'm getting the following exception when using highlighting for a field containing HTMLStripCharFilterFactory: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ... exceeds length of provided text sized 21 It seems this is a know issue: https://issues.apache.org/jira/browse/LUCENE-2208 Does anyone know if there's a fix implemented yet in solr? thanks! -robert
Re: Creating a TrieDateField (and other Trie fields) from Lucene Java
On Tue, Apr 19, 2011 at 11:17 PM, Craig Stires craig.sti...@gmail.com wrote: The barrier I have is that I need to build this offline (without using a solr server, solrconfig.xml, or schema.xml) This is pretty unusual... can you share your use case? Solr can also be run in embedded mode if you can't run a stand-alone server for some reason. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException
Hi, there is a proposed patch uploaded to the issue. Maybe you can help by reviewing/testing it? 2011/4/20 Robert Gründler rob...@dubture.com: Hi all, i'm getting the following exception when using highlighting for a field containing HTMLStripCharFilterFactory: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ... exceeds length of provided text sized 21 It seems this is a know issue: https://issues.apache.org/jira/browse/LUCENE-2208 Does anyone know if there's a fix implemented yet in solr? thanks! -robert
stemming filter analyzers, any favorites?
Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: stemming filter analyzers, any favorites?
You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Bug in solr.KeywordMarkerFilterFactory?
I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected. For testing purposes, I have put the word spelling in my protwords.txt. If I do a test for spelling bees in the analyze tool, the stemmer produces spelling bees - nothing is stemmed. But if I do a test for bees spelling, I get bee spelling, the expected result with bees stemmed but spelling left unstemmed. I have tried extended examples - in every case I tried, all of the words prior to spelling get stemmed, but none of the words after spelling get stemmed. When turning on the verbose mode of the analyze tool, I can see that the settings of the keyword attribute introduced by solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so I think the solr.KeywordMarkerFilterFactory component is to blame, and not anything later in the analyze chain. Any idea what might be going wrong? Is this a known issue? Here is my field type definition, in case it makes a difference: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.ICUTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType thanks, Demian
Re: Bug in solr.KeywordMarkerFilterFactory?
On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz demian.k...@villanova.edu wrote: I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected. You're right! This was broken by LUCENE-2901 back in Jan. I've opened this issue: https://issues.apache.org/jira/browse/LUCENE-3039 The easiest short-term workaround for you would probably be to create a custom filter that looks like KeywordMarkerFilter before the LUCENE-2901 change. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: Solr - Multi Term highlighting issue
Thanks Erick. I tried your suggestion, the issue still exists. http://localhost:8983/searchsolr/mainCore/select?indent=onversion=2.2q=mec+us+chilefq=storyid%3DXXX%22start=0rows=10fl=*qt=standardwt=standardexplainOther=hl=onhl.fl=story%2C+slughl.fragsize=10hl.highlightMultiTerm=truehl.usePhraseHighlighter=truehl.mergeContiguous=false - lst name=params str name=hl.fragsize10/str str name=explainOther / str name=indenton/str str name=hl.mergeContiguousfalse/str ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES ... thanks regards, Rajesh Ramana -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 11:59 AM To: solr-user@lucene.apache.org Subject: Re: Solr - Multi Term highlighting issue Does your configuration have hl.mergeContiguous set to true by any chance? And what happens if you explicitly set this to false on your query? Best Erick On Wed, Apr 20, 2011 at 9:43 AM, Ramanathapuram, Rajesh rajesh.ramanathapu...@turner.com wrote: Hello, I am dealing with a highlighting issue in SOLR, I will try to explain the issue. When I search for a single term in solr, it wraps em tag around the words I want to highlight, all works well. But if I search multiple term, for most part highlighting works good and then for some of the terms, the highlight return multiple terms in a sing em tag ... emsrchtrm1) brbp srchtrm2/em I expect solr to return highlight terms like ... emsrchtrm1/em) brbp... emsrchtrm2/em When I search for 'US mec chile', here is how my result appears ... Corboba. (emMEC)/b/pp/ppbCHILE/em/FOREST FIRES: We had ... with emUS/em and emChile/em ..., (emMEC)/b/pp/ppbUS/em This is what I was expecting it to be ... Corboba. (emMEC/em)/b/pp/ppbemCHILE/em/FOREST FIRES: We had ... with emUS/em and emChile/em ..., (emMEC/em)/b/pp/ppbemUS/em Here is my query params - response - lst name=responseHeader int name=status0/int int name=QTime26/int - lst name=params str name=hl.fragsize10/str str name=explainOther / str name=indenton/str str name=hl.flstory, slug/str str name=wtstandard/str str name=hlon/str str name=rows10/str str name=version2.2/str str name=hl.highlightMultiTermtrue/str str name=fl*/str str name=start0/str str name=qmec us chile/str str name=qtstandard/str str name=hl.usePhraseHighlightertrue/str str name=fqstoryid= X/str /lst /lst Here are some other links I found in the forum, but no real conclusion http://www.lucidimagination.com/search/document/ac64e4f0abb6e4fc/solr_ hi ghlighting_question#78163c42a67cb533 I am going to try this patch, which also had no conclusive results https://issues.apache.org/jira/browse/SOLR-1394 Has anyone come across this issue? Any suggestions on how to fix this issue is much appreciated. thanks regards, Rajesh Ramana
Re: Bug in solr.KeywordMarkerFilterFactory?
No, this is only a bug in analysis.jsp. you can see this by comparing analysis.jsp's dontstems bees to using the query debug interface: lst name=debug str name=rawquerystringdontstems bees/str str name=querystringdontstems bees/str str name=parsedqueryPhraseQuery(text:dontstems bee)/str str name=parsedquery_toStringtext:dontstems bee/str On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz demian.k...@villanova.edu wrote: I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected. You're right! This was broken by LUCENE-2901 back in Jan. I've opened this issue: https://issues.apache.org/jira/browse/LUCENE-3039 The easiest short-term workaround for you would probably be to create a custom filter that looks like KeywordMarkerFilter before the LUCENE-2901 change. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: Bug in solr.KeywordMarkerFilterFactory?
That's good news -- thanks for the help (not to mention the reassurance that Solr itself is actually working right)! Hopefully 3.1.1 won't be too far off, though; when the analysis tool lies, life can get very confusing! :-) - Demian -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, April 20, 2011 2:54 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: Re: Bug in solr.KeywordMarkerFilterFactory? No, this is only a bug in analysis.jsp. you can see this by comparing analysis.jsp's dontstems bees to using the query debug interface: lst name=debug str name=rawquerystringdontstems bees/str str name=querystringdontstems bees/str str name=parsedqueryPhraseQuery(text:dontstems bee)/str str name=parsedquery_toStringtext:dontstems bee/str On Wed, Apr 20, 2011 at 2:43 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Apr 20, 2011 at 2:01 PM, Demian Katz demian.k...@villanova.edu wrote: I've just started experimenting with the solr.KeywordMarkerFilterFactory in Solr 3.1, and I'm seeing some strange behavior. It seems that every word subsequent to a protected word is also treated as being protected. You're right! This was broken by LUCENE-2901 back in Jan. I've opened this issue: https://issues.apache.org/jira/browse/LUCENE-3039 The easiest short-term workaround for you would probably be to create a custom filter that looks like KeywordMarkerFilter before the LUCENE-2901 change. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: ConcurrentLRUCache$Stats error
: https://issues.apache.org/jira/browse/SOLR-1797 that issue doesn't seem to have anything to do with the stack trace reported... : SEVERE: java.util.concurrent.ExecutionException: : java.lang.NoSuchMethodError: : org.apache.solr.common.util.ConcurrentLRUCache$Stats.add(Lorg/apache/solr/c : ommon/util/ConcurrentLRUCache$Stats;)V NoSuchMethodError means that one compiled java class expects another compiled java class to have a method that it does not actually have -- this typically happens when you have inconcsisten classfiles (or jars) in your classpath. ie: you most likely have a mix of jars from two different versions of solr/lucene. -Hoss
RE: stemming filter analyzers, any favorites?
I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text bags term type word source start,end0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
entity name issue
Hi guys, I have encountered a problem with entity name, see the data config code below. the variable '${ea.a_aid}' was always empty. I suspect it is a namespace issue. Anyone knows how to bypass it? This is on oracle database. I had to use the prefix myschema., otherwise, the table name was not recognized. The similar thing worked on database without adding a prefix to the table names. Thanks in advance! entity name=e_a query=select myschema.table_a.aid as id, myschema.table_a.aid as a_aid from myschema.table_a where '${dataimporter.request.clean}' != 'false' and myschema.table_a.aid${dataimporter.request.aid} entity name=e_b query=select col as c_col from myschema.table_b where myschema.table_b.aid='${ea.a_aid}'/ /entity -- View this message in context: http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2843812.html Sent from the Solr - User mailing list archive at Nabble.com.
Highest frequency terms for a subset of documents
Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
RE: Highest frequency terms for a subset of documents
I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind of set up for exactly that case, I can't imagine any other technique being faster -- there's stuff that has to be done to look up the info you want. BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. Works a LOT better for very high arity fields (lots and lots of unique values) like you have. I bet you'll see significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable. With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr 1.4 when facet.method=fc became available, it was nearly impossible to facet on very high arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in Solr 1.4+, if you hadn't explicitly set it to enum instead! Jonathan From: Ofer Fort [ofer...@gmail.com] Sent: Wednesday, April 20, 2011 6:49 PM To: solr-user@lucene.apache.org Subject: Highest frequency terms for a subset of documents Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
Re: How to index MS SQL Server column with image type
: Subject: How to index MS SQL Server column with image type : : Hi all, : : When I index a column(image type) of a table via * : http://localhost:8080/solr/dataimport?command=full-import* : *There is a error like this: String length must be a multiple of four.* For future refrence: full error message (with stack traces) are the best way to get people to help you diagnose problems. I think the crux of hte issue is that DataImportHandler doesn't currently have any way of indexing raw binary data like images. Under teh covers, Solr can deal with pure binary fields, but there aren't a lot of good usecases i can think of for it -- particularly if you want to *index* those bytes... : field name=bs_attachment type=binary indexed=true stored=true/ ...can you please explain what your goal is? what are you ultimatley hoping to do with that field? -Hoss
Re: Highest frequency terms for a subset of documents
thanks, but that's what i started with, but it took an even longer time and threw this: Approaching too many values for UnInvertedField faceting on field 'text' : bucket size=15560140 Approaching too many values for UnInvertedField faceting on field 'text : bucket size=15619075 Exception during facet counts:org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field text On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind of set up for exactly that case, I can't imagine any other technique being faster -- there's stuff that has to be done to look up the info you want. BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. Works a LOT better for very high arity fields (lots and lots of unique values) like you have. I bet you'll see significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable. With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr 1.4 when facet.method=fc became available, it was nearly impossible to facet on very high arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in Solr 1.4+, if you hadn't explicitly set it to enum instead! Jonathan From: Ofer Fort [ofer...@gmail.com] Sent: Wednesday, April 20, 2011 6:49 PM To: solr-user@lucene.apache.org Subject: Highest frequency terms for a subset of documents Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
Re: Highest frequency terms for a subset of documents
seems like the facet search is not all that suited for a full text field. ( http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197 ) Maybe i should go another direction. I think that the HighFreqTerms approach, just not sure how to start. On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort o...@tra.cx wrote: thanks, but that's what i started with, but it took an even longer time and threw this: Approaching too many values for UnInvertedField faceting on field 'text' : bucket size=15560140 Approaching too many values for UnInvertedField faceting on field 'text : bucket size=15619075 Exception during facet counts:org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field text On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind rochk...@jhu.eduwrote: I think faceting is probably the best way to do that, indeed. It might be slow, but it's kind of set up for exactly that case, I can't imagine any other technique being faster -- there's stuff that has to be done to look up the info you want. BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. Works a LOT better for very high arity fields (lots and lots of unique values) like you have. I bet you'll see significant speed-up if you use facet.method=fc instead, hopefully fast enough to be workable. With facet.method=enum, I would have indeed predicted it would be horribly slow, before solr 1.4 when facet.method=fc became available, it was nearly impossible to facet on very high arity fields, facet.method=fc is the magic. I think facet.method=fc is even the default in Solr 1.4+, if you hadn't explicitly set it to enum instead! Jonathan From: Ofer Fort [ofer...@gmail.com] Sent: Wednesday, April 20, 2011 6:49 PM To: solr-user@lucene.apache.org Subject: Highest frequency terms for a subset of documents Hi, I am looking for the best way to find the terms with the highest frequency for a given subset of documents. (terms in the text field) My first thought was to do a count facet search , where the query defines the subset of documents and the facet.field is the text field, this gives me the result but it is very very slow. These are my params: str name=facettrue/str str name=facet.offset0/str str name=facet.mincount3/str str name=indenton/str str name=facet.limit500/str str name=facet.methodenum/str str name=wtxml/str str name=rows0/str str name=version2.2/str str name=facet.sortcount/str str name=qin_subset:1/str str name=facet.fieldtext/str /lst The index contains 7M documents, the subset is about 200K. A simple query for the subset takes around 100ms, but the facet search takes 40s. Am i doing something wrong? If facet search is not the correct approach, i thought about using something like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this in solr. Should i implememt a request handler that executes this kind of code? thanks for any help
Re: Highest frequency terms for a subset of documents
: thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets and FWIW... : If facet search is not the correct approach, i thought about using : something : like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this : in solr. Should i implememt a request handler that executes this kind of HighFreqTerms just looks at the raw docfreq for the terms, nearly identical to the TermsComponent -- there is no way to deal with your subset of documents requrements using an approach like that. If the number of subsets you have to deal with are fixed, finite, and non-overlapping, using distinct cores for each subset (which you can aggregate using distributed search when you don't want this type of query) can also be a wise choice in many situations (ie: if you have a books core and a movies core you can search both using distributed search, or hit the terms component on just one of them to get the top terms for that core) -Hoss
Re: Highest frequency terms for a subset of documents
On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets Right - or only using filterCache for high df terms via http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory), and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets Right - or only using filterCache for high df terms via http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: How to return score without using _val_
On Tue, Apr 19, 2011 at 11:41 PM, Bill Bell billnb...@gmail.com wrote: I would like to influence the score but I would rather not mess with the q= field since I want the query to dismax for Q. Something like: fq={!type=dismax qf=$qqf v=$qspec} fq={!type=dismax qt=dismaxname v=$qname} q=_val_:{!type=dismax qf=$qqf v=$qspec} _val_:{!type=dismax qt=dismaxname v=$qname} Is there a way to do a filter and add the FQ to the score by doing it another way? Also does this do multiple queries? Is this the right way to do it? I really don't understand what you're trying to do... Backing up, you say you want to influence the score, but I can't figure out how you would like to influence the score. Do you want to: - add the score of another query to the main dismax query (use bq) - multiply the main dismax score by another query (use edismax along with boost, or the boost query type) - something else? -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
BTW, i'm using solr 1.4.1, does 3.1 or 4.0 contain any performance improvements that will make a difference as far as facet search? thanks again Ofer On Thu, Apr 21, 2011 at 2:45 AM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory), and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : thanks, but that's what i started with, but it took an even longer time and : threw this: : Approaching too many values for UnInvertedField faceting on field 'text' : : bucket size=15560140 : Approaching too many values for UnInvertedField faceting on field 'text : : bucket size=15619075 : Exception during facet counts:org.apache.solr.common.SolrException: Too many : values for UnInvertedField faceting on field text right ... facet.method=fc is a good default, but cases like full text faceting can cause it to seriously blow up the memory ... i didn't eve realize it was possible to get it to fail this way, i would have just expected an OutOfmemoryException. facet.method=enum is probably your best bet in this situation precisely because it does a linera scan over the terms ... it's slower because it's safer. the one speed up you might be able to get is to ensure you don't use the filterCache -- that way you don't wast time constantly caching/overwriting DocSets Right - or only using filterCache for high df terms via http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: Highest frequency terms for a subset of documents
On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory) Then you should not disable the cache. , and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. Using facet.enum.cache.minDf should be a little faster than just disabling the cache - it's a different code path. Using the cache selectively will speed things up, so try setting that minDf to 1000 or so for example. How many unique terms do you have in the index? Is this Solr 3.1 - there were some optimizations when there were many terms to iterate over? You could also try trunk, which has even more optimizations, or the bulkpostings branch if you really want to experiment. -Yonik
Re: Highest frequency terms for a subset of documents
my documents are user entries, so i'm guessing they vary a lot. Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement. thanks guys! On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote: Thanks but i've disabled the cache already, since my concern is speed and i'm willing to pay the price (memory) Then you should not disable the cache. , and my subset are not fixed. Does the facet search do any extra work that i don't need, that i might be able to disable (either by a flag or by a code change), Somehow i feel, or rather hope, that counting the terms of 200K documents and finding the top 500 should take less than 30 seconds. Using facet.enum.cache.minDf should be a little faster than just disabling the cache - it's a different code path. Using the cache selectively will speed things up, so try setting that minDf to 1000 or so for example. How many unique terms do you have in the index? Is this Solr 3.1 - there were some optimizations when there were many terms to iterate over? You could also try trunk, which has even more optimizations, or the bulkpostings branch if you really want to experiment. -Yonik
Solr - upgrade from 1.4.1 to 3.1 - finding AbstractSolrTestCase binaries - help please?
HI, all. I'm working on upgrading from 1.4.1 to 3.1, and I'm having some troubles with some of the unit test code for our custom Filters. We wrote the tests to extend AbstractSolrTestCase, and I've been reading the thread about the test-harness elements not being present in the 3.1 distributables. [1] So, I have checked out the 3.1 branch code and built that (ant generate-maven-artifacts), and I've found the lucene-test-framework-3.1-xxx.jar(s). However, these contain only the lucene level framework elements, and none of the solr. Did the solr test framework actually get built and embedded in one of the solr jars somewhere? Or, if not, is there some way to build a jar that contains the solr portion of the test harnesses? [1] SOLR-2061https://issues.apache.org/jira/browse/SOLR-2061 Generate jar containing test classes.https://issues.apache.org/jira/browse/SOLR-2061 * Thanks! Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.comhttp://www.sirsidynix.com/
RE: Creating a TrieDateField (and other Trie fields) from Lucene Java
Hi Yonik, The limitations I need to work within, have to do with the index already being built as part of an existing process. Currently, the Solr server is in read-only mode and receives new indexes daily from a Java application. The Java app runs Lucene/Tika and is indexing resources within the local network. It builds off of a different schema framework, then moves the finished indexes over to the Solr deployment path. The Solr server swaps over at that point. The Solr server isn't the only consumer of the indexes. There are other Java apps which read/write to the Lucene index, during the staging process. This was working without issues, when using types were part of Lucene core (String, Boolean, Integer, etc), because they just resolved to Strings. But, the TrieDateField works off of byte data, so needed to find a way to create those fields, using the existing classes. Thanks, -Craig -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Wednesday, 20 April 2011 11:19 PM To: solr-user@lucene.apache.org Subject: Re: Creating a TrieDateField (and other Trie fields) from Lucene Java On Tue, Apr 19, 2011 at 11:17 PM, Craig Stires craig.sti...@gmail.com wrote: The barrier I have is that I need to build this offline (without using a solr server, solrconfig.xml, or schema.xml) This is pretty unusual... can you share your use case? Solr can also be run in embedded mode if you can't run a stand-alone server for some reason. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
The issue of import data from database using Solr DIH
Hi all, I am a new to solr,I am importing data from database using DIH(solr 1.4).One document is made up of two entity,Every entity is a table in database. For example: Table1:have 3 fields; Table2:have 4 fields; If it is Ok,it will be 7 fields. But it is only 4 fields,it seem that solr don't merge the fields and table2 over write table1. The key is OS06Y. The configuration of db-data-config.xml is the following: document name=allperf entity name=PerformanceData1 dataSource=getTrailingTotalReturnForMonthEnd1 query=SELECT Perfo rmanceId,Trailing1MonthReturn,Trailing2MonthReturn,Trailing3MonthReturn, FROM Table1 field column=PerformanceId name=OS06Y / field column=Trailing1MonthReturn name=PM004 / field column=Trailing2MonthReturn name=PM133 / field column=Trailing3MonthReturn name=PM006 / /entity entity name=PerformanceData2 dataSource=getTrailingTotalReturnForMonthEnd2 query=SELECT Performan ceId,Trailing10YearReturn,Trailing15YearReturn,TrailingYearToDateReturn, SinceInceptionReturn FROM Table2 field column=PerformanceId name=OS06Y / field column=Trailing10YearReturn name=PM00I / field column=Trailing15YearReturn name=PM00K / field column=TrailingYearToDateReturn name=PM00A / field column=SinceInceptionReturn name=PM00M / /entity /document Has anyone come across this issue? Any suggestions on how to fix this issue is much appreciated. Thanks.
Apache Spam Filter Blocking Messages
Hey (solr-user) Mailing list admin's, I've tried replying to a thread multiple times tonight, and keep getting a bounce-back with this response: Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 552 552 spam score (5.1) exceeded threshold (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL (state 18). Apparently I sound like spam when I write perfectly good English and include some xml and a link to a jira ticket in my e-mail (I tried a couple different variations). Anyone know a way around this filter, or should I just respond to those involved in the e-mail chain directly and avoid the mailing list? Thanks, -Trey
Re: Apache Spam Filter Blocking Messages
On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote: (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL Note the HTML_MESSAGE in the list of things SpamAssassin didn't like. Apparently I sound like spam when I write perfectly good English and include some xml and a link to a jira ticket in my e-mail (I tried a couple different variations). Anyone know a way around this filter, or should I just respond to those involved in the e-mail chain directly and avoid the mailing list? Send plain text email instead of HTML. That solves the problem 99% of the time. Marvin Humphrey
Need to create dyanamic indexies base on different document workspaces
Hi, Is there a way to create different solr indexes for different categories? We have different document workspaces and ideally want each workspace to have its own solr index. Thanks, Gaurav