Re: Apache Lucene Eurocon 2012
Hi Chris, thanks for your response.Ok, we will wait :) Best Regards Vadim 2012/3/8 Chris Hostetter hossman_luc...@fucit.org : where and when is the next Eurocon scheduled? : I read something about denmark and autumn 2012(i don't know where *g*). I do not know where, but sometime in the fall is probably the correct time frame. I beleive the details will be announced at Lucene Revolution... http://lucenerevolution.org/ (that's what happened last year) -Hoss
RE: Custom Sharding on solrcloud
Hi, If I remove the DistributedUpdateProcessorFactory I will have to manage a master slave setup myself by updating solely to the master and replicating to any slave. I wonder is it possible to have distributed updates but confined to the sub-set of cores and replicas within a collection that share the same name? Phil -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: 08 March 2012 01:02 To: solr-user@lucene.apache.org Subject: Re: Custom Sharding on solrcloud Hi Phil - The default update chain now includes the distributed update processor by default - and if in solrcloud mode it will be active. Probably, what you want to do is define your own update chain (see the wiki). Then you can add that update chain as the default for your json update handler in solrconfig.xml. !-- referencing it in an update handler -- requestHandler name=/update/json class=solr.JsonUpdateRequestHandler lst name=defaults str name=update.chainmychain/str /lst /requestHandler The default chain is: new LogUpdateProcessorFactory(), new DistributedUpdateProcessorFactory(), new RunUpdateProcessorFactory() So just use Log and Run instead to get your old behavior. - Mark On Mar 7, 2012, at 1:37 PM, Phil Hoy wrote: Hi, We have a large index and would like to shard by a particular field value, in our case surname. This way we can scale out to multiple machines, yet as most queries filter on surname we can use some application logic to hit just the one core to get the results we need. Furthermore as we anticipate the index will grow over time so it make sense (to us) to host a number of shards on a single machine until they get too big at which point we can then move them to another machine. We are using solrcloud and it is set up using a solrcore per shard, that way we can direct both queries and updates to the appropriate core/shard. To do this our solr.xml looks a bit like this: cores defaultCoreName=default adminPath=/admin/cores zkClientTimeout=1 hostPort=8983 core shard=default name=aaa-ava instanceDir=/data/recordsets/shards/aaa-ava collection=recordsets / core shard=aaa-ava name=aaa-ava instanceDir=/data/recordsets/shards/aaa-ava collection=recordsets / core shard=avb-bel name=avb-bel instanceDir=/data/recordsets/shards/avb-bel collection=recordsets / ... Directed updates via: http:/server/solr/aaa-ava/update/json [{surname:adams}] Directed queries via: http:/server/solr/select?surname:adamsshards=aaa-ava This setup used to work in version apache-solr-4.0-2011-12-12_09-14-13 before the more recent solrcloud changes but now the update is not directed to the appropriate core. Is there a better way to achieve our needs? Phil - Mark Miller lucidimagination.com __ This email has been scanned by the brightsolid Email Security System. Powered by MessageLabs __
Re: indexing cpu utilization
On 8 March 2012 15:39, gabriel shen xshco...@gmail.com wrote: Hi, I noticed that, sequential indexing on 1 solr core is only using 40% of our 8 virtual core CPU power. Why isn't it use 100% of the power? Is there a way to increase CPU utilization rate? [...] This is an open-ended question which could be due to a variety of things, and also depends on how you are indexing. Your indexing process might be I/O bounded (quite possible), or memory bounded, rather than CPU bounded. Regards, Gora
Re: indexing cpu utilization
Our indexing process is to adding a bundle of solr documents(for example 5000) to solr each time, and we observed that before commiting(which might be io bounded) it uses less than half the CPU capacity constantly, which sounds strange to us why it doesn't use full cpu power. As for RAM, I don't know how much it will affect CPU utilization, we have assigned 14gb to the solr tomcat server on a 32 gb linux machine. best regards, shen On Thu, Mar 8, 2012 at 11:27 AM, Gora Mohanty g...@mimirtech.com wrote: On 8 March 2012 15:39, gabriel shen xshco...@gmail.com wrote: Hi, I noticed that, sequential indexing on 1 solr core is only using 40% of our 8 virtual core CPU power. Why isn't it use 100% of the power? Is there a way to increase CPU utilization rate? [...] This is an open-ended question which could be due to a variety of things, and also depends on how you are indexing. Your indexing process might be I/O bounded (quite possible), or memory bounded, rather than CPU bounded. Regards, Gora
Moving from Multiple webapps to Multi Cores -Solr 1.3
Hello All, On Protyping from moving from solr Multiple Webapps to Solr Multi Cores [1.3 Version both]..I am running into the following issues and Questions 1) We are primarily moving to Multicore because ,we saw the Permgen memory being increased ,each time we created a new solr webapp ,so the assumption is that by moving to Multicore and sharing the same war file ,we will not increase the permgen memory ,when we create a new core ,but I do see about 190kb increase when a new core is created as opposed to about 13mb per new webapp , does the permgen memory get consumed /increased per core creation with some benefit over webapp creation? 2) We have schemas for multiple languages ,and I wanted to create webapp per language and create cores for each client with same kang requirement ,with shared schema ,Would that affect if we want to add some dynamic fields to some cores [ofcourse the indexes are separate] ? Does this approach make sense or we can just create n number of cores in a single webapp with different schemas ? 3) In terms of query time ,when i query a webapp to a particular core ,should I expect the Qtime come down or remain same? 4) on Using the create command as multi_core_prototype/admin/cores?action=CREATEname=coreXinstanceDir=/searchinstances/multi_core_prototype/solr/coreXconfig=/searchinstances/multi_core_prototype/solr/coreXschema=/searchinstances/multi_core_prototype/solr/core0/conf/schema.xmldataDir/searchinstances/multi_core_prototype/solr/coreX/data My Directory structure is tomcat5.5 Searchinstances ...multi_core_prototype ...solr.war ..solr .. solr.xml .. core0 ...data ..conf ..core1 ..conf ..data On the above command instance dir ,coreX is created under solr and data directory under coreX ,however I dont see a conf directory with schema and Solrconfig under CoreX,I am assuming with the above command it copies it from the existing core0 conf folder Let me know if I am missing anything here. Thanks, Sujatha
Re: How to exactly match fields which are multi-valued?
You haven't really given us much to go on here. Matches are just like a single valued field with the exception of the increment gap. Say one entry were large cat big dog in a multi-valued field. ay the next document indexed two values, large cat big dog And, say the increment gap were 100. The token offsets for doc 1 would be 0, 1, 2, 3 and for doc 2 would be 0, 1, 101, 102 The only effective difference is that phrase queries with slop less than 100 would NEVER match across multi-values. I.e. cat big~10 would match doc1 but not doc 2 Best Erick 2012/3/7 SuoNayi suonayi2...@163.com: Hi all, how to offer exact-match capabilities on the multi-valued fields? Any helps are appreciated! SuoNayi
Re: Question about Streaming Update Solr Server
Anyone could reply this questions? Thanks 2012/3/5 Anderson vasconcelos anderson.v...@gmail.com Hi I have some questions about StreamingUpdateSolrServer. 1)What's queue size parameter? It's the number of documents in each thread? 2)When i configurated like this StreamingUpdateSolrServer(URL, 1000, 5) indexing runs ok. But when i up the number of threads like this new StreamingUpdateSolrServer(URL, 1000, 15) i received a java.net.SocketException: Broken pipe. Why? 3)When i indexing using addBean method, they open the max of threads than i configured. But when i use addBeans, they open only one thread. Is this correct? Thanks
Re: solr geospatial / spatial4j
Yes, there are trunk nightly builds, see: https://builds.apache.org//view/S-Z/view/Solr/job/Solr-trunk/ But I don't think LSP is in trunk at this point, so that's not useful. The code branch is on (I think) http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_3795_ls_spatial_playground but I confess I haven't tried to get and build it all, I'm not quite sure what's needed Best Erick On Wed, Mar 7, 2012 at 10:25 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I'm researching options for handling a better geospatial solution. I'm currently using Solr 3.5 for a read-only database, and the point/radius searches work great. But I'd like to start doing point in polygon searches as well. I've skimmed through some of the geospatial jira issues, and read about spaitial4j, which is very interesting. I see on the github page that this will soon be part of lucene, can anyone confirm this? I attempted to build the spatial4j demo but no luck. It had problems finding lucene 4.0-SNAPSHOT, which I guess is because there are no 4.0-SNAPSHOT nightly builds? If anyone knows how I can get around this, please let me know! Other than spatial4j, is there a way to do point in polgyon searches with solr 3.5.0 right now? Is there some tricky indexing/querying strategy that would allow this? Thanks! - Matt
Re: How to limit the number of open searchers?
Ah, you're right. If you queries run across several commits you'll get multiple searchers open. I don't know of any good way to do what you want. I'm curious, why can't you do a master/slave setup? The other thing to think about would be the NRT stuff if you can run trunk. Best Erick On Wed, Mar 7, 2012 at 2:30 PM, Michael Ryan mr...@moreover.com wrote: Unless you have warming happening, there should only be a single searcher open at any given time. So it seems to me that maxWarmingSearchers should give you what you need. What I'm seeing is that if a query takes a very long time to run, and runs across the duration of multiple commits (I know, that itself sounds bad!), I can get into a situation where I have 2 searchers in use and 1 searcher warming, rather than 1 searcher in use and 1 searcher warming. Due to all the memory-intensive features I use, having 3 or more searchers open can cause an OutOfMemoryError. I'm not using master/slave for this application, so can't go that route. I'd like a way to see how many searchers are currently open that is external to Solr. This would allow me to block my commits until I see that there is only 1 searcher currently open. I could use JMX, but that feels like overkill - wondering if there is something simpler. -Michael
Re: indexing cpu utilization
How are you sending documents to solr ? If you push solr input documents via HTTP (which is what SolrJ does), you could increase CPU consumption (and therefor reduce indexing time) by sending your update requests asynchronously, using multiple updating threads, to your single solr core. Somebody more familiar than me with the Update chain could probably tell you more, but I think each update request is treated inside a single thread on the server side. If that's correct, then you can increase CPU consumption on your indexing host by adding more updating threads (to the client pushing documents to your solr core) Also make sure you don't ask solr to commit your pending changes to solr index too frequently (on each add), but only when you want changes to be taken into account on the searching side. I personnaly like to let solr do autoCommits, using a combo of max added documents and elapsed time conditions for the auto commit policy. Considering indexing bottlenecks more generally, my experience in that field, is that indexing speed is usually bound to, in frequency order : - source enumeration speed (especially if solr input documents are made out of complex joins on a remote DB) - Network IO if performing remote indexing and the network link isn't adapted to amount of data running through it - Disk IO if you commit very often and rely on commodity SATAs HDDs, or if another process is stressing the poor little device (keep the 150 IOPS limit in mind for sata devices) - CPU if were able to get rid of previous bottlenecks - Memory isn't playing the same role in indexing speed than other factors, because from my point of view it would only be a limit if you perform complex analysis on many many fields, and if that becomes a problem, then it becomes easy to spot with JMX and JConsole because your JVM would then be performing many GCs, and the process's resident RAM usage will be close to whatever was set to -Xmx . I don't know if I was really clear, all I can say is that increasing the number of clients pushing updates to solr in parrallel was the easiest for me to reduce the indexing time for large update batches. Hope this helps, -- Tanguy Le 08/03/2012 11:48, gabriel shen a écrit : Our indexing process is to adding a bundle of solr documents(for example 5000) to solr each time, and we observed that before commiting(which might be io bounded) it uses less than half the CPU capacity constantly, which sounds strange to us why it doesn't use full cpu power. As for RAM, I don't know how much it will affect CPU utilization, we have assigned 14gb to the solr tomcat server on a 32 gb linux machine. best regards, shen On Thu, Mar 8, 2012 at 11:27 AM, Gora Mohantyg...@mimirtech.com wrote: On 8 March 2012 15:39, gabriel shenxshco...@gmail.com wrote: Hi, I noticed that, sequential indexing on 1 solr core is only using 40% of our 8 virtual core CPU power. Why isn't it use 100% of the power? Is there a way to increase CPU utilization rate? [...] This is an open-ended question which could be due to a variety of things, and also depends on how you are indexing. Your indexing process might be I/O bounded (quite possible), or memory bounded, rather than CPU bounded. Regards, Gora
Re: Stemmer Question
I was previously using the PorterStemmer to do stemming and ran into an issue where it was overly aggressive with some words or abbreviations which I needed to stop. I have recently switched to KStem and I believe the issue is less, but I was wondering still if there was a way to set a number of stop words for which you didn't want stemming to occur or if there was a way to tell the Stemmer to store the unstemmed version as well. So for instance if a query came in for Ahmed, the PorterStemmer would turn that into Ahm, while in this case Ahmed is a name and I want to search that unstemmed. If there was a stop word list I could attempt to compile a list of words I didn't want stem or if there was a way to say also say create a token for the unstemmed word so what went into the index for Ahmed would be ahmed ahm so we'd cover both cases. What are the draw backs of providing both. StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for these kind of purposes. http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming
Re: indexing cpu utilization
On 8 March 2012 16:18, gabriel shen xshco...@gmail.com wrote: Our indexing process is to adding a bundle of solr documents(for example 5000) to solr each time, and we observed that before commiting(which might be io bounded) it uses less than half the CPU capacity constantly, which sounds strange to us why it doesn't use full cpu power. As for RAM, I don't know how much it will affect CPU utilization, we have assigned 14gb to the solr tomcat server on a 32 gb linux machine. [...] Are you hitting memory limits? As Tanguy has already pointed out in nice detail, it probably also does matter how you push documents to Solr, and how often you commit. In an apples-to-oranges comparison, we used to run a large indexing task, but with only a single commit at the end, while it sounds as if you are using smaller batches, with more frequent commits. In our case, we could max out CPU usage (well, we backed off at ~85% utilisation on each core). Though we were fetching data over the network, it was a relatively high-bandwidth internal connection, and we were using DIH with multiple Solr cores. Regards, Gora
Understanding update handler statistics
Hi, Trying to understand the update handler statistics so I have this: commits : 2824 autocommit maxDocs : 1 autocommit maxTime : 1000ms autocommits : 41 optimizes : 822 rollbacks : 0 expungeDeletes : 0 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 17457 cumulative_deletesById : 1959 cumulative_deletesByQuery : 0 cumulative_errors : 0 my problem is with the cumulative part. If for instance I am doing a commit after each add and delete operation then the sum of cumulative_adds plus cumulative_deletes plus cumulative_errors should much the commit number. is that right? And another question, these stats are since SOLR instance startup or since update handler startup, these can differ as far as I understand... and from this part: docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 I understand that if I had docsPending I should have adds(pending) deletes*(pending) but how could I have errors... thanks stelios -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-update-handler-statistics-tp3809743p3809743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: wildcard queries with edismax and lucene query parsers
Any help on this? I am really stuck on a client project. I need to know how scoring works with wildcard queries under SOLR 3.2. Thanks Bob On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart bstewart...@gmail.com wrote: How is scoring affected by wildcard queries? Seems when I use a wildcard query I get all constant scores in response (all scores = 1.0). That occurs with both edismax as well as lucene query parser. I am trying to implement auto-suggest feature so I need to use wild card to return all results that match the prefix entered by a user. But I want the results sorted according to score defined by the qf parameter in my search handler. ?defType=edismaxq=grow*fl=title,score result name=response numFound=11 start=0 maxScore=1.0 doc float name=score1.0/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score1.0/float arr name=title strSP 1000 Pure Growth/str /arr /doc ?defType=luceneq=grow*fl=title,score result name=response numFound=11 start=0 maxScore=1.0 doc float name=score1.0/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score1.0/float arr name=title strSP 1000 Pure Growth/str /arr /doc If I use query with no wildcard, scoring appears correct: ?defType=edismaxq=growthfl=title,score result name=response numFound=11 start=0 maxScore=0.7500377 doc float name=score0.7500377/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score0.7500377/float arr name=title strSP 500 Growth/str /arr /doc doc float name=score0.656283/float arr name=title strSP 1000 Pure Growth/str /arr /doc I am using SOLR version 3.2 and using a request handler defined like this: requestHandler name=/idxsuggest class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=q*:*/str str name=qf ticker^10.0 indexCode^10.0 indexKey^10.0 title^5.0 indexName^5.0 /str str name=flindexId,indexName,indexCode,indexKey,title,ticker,urlTitle/str /lst lst name=appends !-- Filter out documents that are not published yet and that are not yet expired -- str name=fq+contentType:IndexProfile/str /lst /requestHandler
Re: wildcard queries with edismax and lucene query parsers
WildcardQueries are wrapped into ConstantScoreQuery. I would create a copy field of these fields using the following field type. Then you can search on these copyFields (qf). With this approach you don't need to use start operator. defType=edismaxq=growfl=title,score fieldType name=prefix_token class=solr.TextField positionIncrementGap=1 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms_index.txt ignoreCase=true expand=true / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=20 / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType --- On Thu, 3/8/12, Robert Stewart bstewart...@gmail.com wrote: From: Robert Stewart bstewart...@gmail.com Subject: Re: wildcard queries with edismax and lucene query parsers To: solr-user@lucene.apache.org Date: Thursday, March 8, 2012, 4:21 PM Any help on this? I am really stuck on a client project. I need to know how scoring works with wildcard queries under SOLR 3.2. Thanks Bob On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart bstewart...@gmail.com wrote: How is scoring affected by wildcard queries? Seems when I use a wildcard query I get all constant scores in response (all scores = 1.0). That occurs with both edismax as well as lucene query parser. I am trying to implement auto-suggest feature so I need to use wild card to return all results that match the prefix entered by a user. But I want the results sorted according to score defined by the qf parameter in my search handler. ?defType=edismaxq=grow*fl=title,score result name=response numFound=11 start=0 maxScore=1.0 doc float name=score1.0/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score1.0/float arr name=title strSP 1000 Pure Growth/str /arr /doc ?defType=luceneq=grow*fl=title,score result name=response numFound=11 start=0 maxScore=1.0 doc float name=score1.0/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score1.0/float arr name=title strSP 1000 Pure Growth/str /arr /doc If I use query with no wildcard, scoring appears correct: ?defType=edismaxq=growthfl=title,score result name=response numFound=11 start=0 maxScore=0.7500377 doc float name=score0.7500377/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score0.7500377/float arr name=title strSP 500 Growth/str /arr /doc doc float name=score0.656283/float arr name=title strSP 1000 Pure Growth/str /arr /doc I am using SOLR version 3.2 and using a request handler defined like this: requestHandler name=/idxsuggest class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=q*:*/str str name=qf ticker^10.0 indexCode^10.0 indexKey^10.0 title^5.0 indexName^5.0 /str str name=flindexId,indexName,indexCode,indexKey,title,ticker,urlTitle/str /lst lst name=appends !-- Filter out documents that are not published yet and that are not yet expired -- str name=fq+contentType:IndexProfile/str /lst /requestHandler
Re: Understanding update handler statistics
On 3/8/2012 7:02 AM, stetogias wrote: Hi, Trying to understand the update handler statistics so I have this: commits : 2824 autocommit maxDocs : 1 autocommit maxTime : 1000ms autocommits : 41 optimizes : 822 rollbacks : 0 expungeDeletes : 0 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 17457 cumulative_deletesById : 1959 cumulative_deletesByQuery : 0 cumulative_errors : 0 my problem is with the cumulative part. If for instance I am doing a commit after each add and delete operation then the sum of cumulative_adds plus cumulative_deletes plus cumulative_errors should much the commit number. is that right? And another question, these stats are since SOLR instance startup or since update handler startup, these can differ as far as I understand... and from this part: docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 I understand that if I had docsPending I should have adds(pending) deletes*(pending) but how could I have errors... I'm fairly sure that adds and deletes refer to the number of documents added or deleted. You can have many documents added and/or deleted for each commit. I would not expect the sums to match, unless you are adding or deleting only one document at a time and doing a commit after every one. I hope you're not doing that, unless you're using trunk with the near-realtime feature and doing soft commits, with which I have no experience. Normally doing a commit after every document would be too much of a load for good performance, unless there is a relatively long time period between each add or delete. Your question about errors - that probably tracks the number of times that the update handler returned an error response, though I don't really know. If I'm right, then that number, like commits, has little to do with the number of documents. Thanks, Shawn
Re: wildcard queries with edismax and lucene query parsers
Ahmet, That is a great idea. I will try it. Thank you. On Thu, Mar 8, 2012 at 9:34 AM, Ahmet Arslan iori...@yahoo.com wrote: WildcardQueries are wrapped into ConstantScoreQuery. I would create a copy field of these fields using the following field type. Then you can search on these copyFields (qf). With this approach you don't need to use start operator. defType=edismaxq=growfl=title,score fieldType name=prefix_token class=solr.TextField positionIncrementGap=1 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms_index.txt ignoreCase=true expand=true / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=20 / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType --- On Thu, 3/8/12, Robert Stewart bstewart...@gmail.com wrote: From: Robert Stewart bstewart...@gmail.com Subject: Re: wildcard queries with edismax and lucene query parsers To: solr-user@lucene.apache.org Date: Thursday, March 8, 2012, 4:21 PM Any help on this? I am really stuck on a client project. I need to know how scoring works with wildcard queries under SOLR 3.2. Thanks Bob On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart bstewart...@gmail.com wrote: How is scoring affected by wildcard queries? Seems when I use a wildcard query I get all constant scores in response (all scores = 1.0). That occurs with both edismax as well as lucene query parser. I am trying to implement auto-suggest feature so I need to use wild card to return all results that match the prefix entered by a user. But I want the results sorted according to score defined by the qf parameter in my search handler. ?defType=edismaxq=grow*fl=title,score result name=response numFound=11 start=0 maxScore=1.0 doc float name=score1.0/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score1.0/float arr name=title strSP 1000 Pure Growth/str /arr /doc ?defType=luceneq=grow*fl=title,score result name=response numFound=11 start=0 maxScore=1.0 doc float name=score1.0/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score1.0/float arr name=title strSP 1000 Pure Growth/str /arr /doc If I use query with no wildcard, scoring appears correct: ?defType=edismaxq=growthfl=title,score result name=response numFound=11 start=0 maxScore=0.7500377 doc float name=score0.7500377/float arr name=title strSP 1000 Growth/str /arr /doc doc float name=score0.7500377/float arr name=title strSP 500 Growth/str /arr /doc doc float name=score0.656283/float arr name=title strSP 1000 Pure Growth/str /arr /doc I am using SOLR version 3.2 and using a request handler defined like this: requestHandler name=/idxsuggest class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypeedismax/str str name=q*:*/str str name=qf ticker^10.0 indexCode^10.0 indexKey^10.0 title^5.0 indexName^5.0 /str str name=flindexId,indexName,indexCode,indexKey,title,ticker,urlTitle/str /lst lst name=appends !-- Filter out documents that are not published yet and that are not yet expired -- str name=fq+contentType:IndexProfile/str /lst /requestHandler
Re: Stemmer Question
Thanks the KeywordMarkerFilterFactory seems to be what I was looking for. I'm still wondering about keeping the unstemmed word as a token though. While I know that this would increase the index size slightly I wonder what the negative of doing such a thing would be? Just seems less destructive since I always store the unstemmed version and the stemmed version. By not storing the unstemmed version there is no way to go back without reindexing. If I wanted to implement this I'm assuming a custom tokenizer would be most appropriate? Does something like this already exist? On Thu, Mar 8, 2012 at 8:36 AM, Ahmet Arslan iori...@yahoo.com wrote: I was previously using the PorterStemmer to do stemming and ran into an issue where it was overly aggressive with some words or abbreviations which I needed to stop. I have recently switched to KStem and I believe the issue is less, but I was wondering still if there was a way to set a number of stop words for which you didn't want stemming to occur or if there was a way to tell the Stemmer to store the unstemmed version as well. So for instance if a query came in for Ahmed, the PorterStemmer would turn that into Ahm, while in this case Ahmed is a name and I want to search that unstemmed. If there was a stop word list I could attempt to compile a list of words I didn't want stem or if there was a way to say also say create a token for the unstemmed word so what went into the index for Ahmed would be ahmed ahm so we'd cover both cases. What are the draw backs of providing both. StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for these kind of purposes. http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming
Importing dynamicField data on the fly
Hello Everyone, I'm trying to work out how, if at all possible, dynamicFields can be imported from a dynamic data source through the DataImportHandler configurations. Currently the DataImportHandler configuration file requires me to name every single field I want to map in advance, but I do not know the dynamicField set at this stage necessarily. Here's my example schema.xml dynamic field definition: dynamicField name=*_sortable type=alphaOnlySort indexed=true stored=true/ My DataImportHandler import configuration file looks like this: dataSource name=Gateway1Source type=HttpDataSource baseUrl= http://acproplatforms.internal/feeds.xml; encoding=UTF-8 connectionTimeout=15000 readTimeout=15000/ document name=feeds entity name=feed processor=XPathEntityProcessor stream=true forEach=/gateway/feedItem/ url= field column=type xpath=/gateway/feedItem/type/ ... /entity /document /dataConfig I have looked, very optimistically, at Script Transformers (transformer=script:importDynamics), specifically hoping the row in the transformer function would hold the dynamic field content, but this was silly thinking obviously, as they would already fall through had they made it into here. Has anyone managed to import into dynamic fields in advance of knowing what they were going to be in the data source? To give you an idea of why I want this, there's an application aggregating web services from many sources, some of which contain patterns of fields I know we'll want, and the nature of their data types, but which are added to quite frequently. It seems aside from the field mappings here, the hard work has been done in Solr to achieve this! Kindest Regards, Mark From: Shawn Heisey s...@elyograg.org To: solr-user@lucene.apache.org, Date: 08/03/2012 14:58 Subject:Re: Understanding update handler statistics On 3/8/2012 7:02 AM, stetogias wrote: Hi, Trying to understand the update handler statistics so I have this: commits : 2824 autocommit maxDocs : 1 autocommit maxTime : 1000ms autocommits : 41 optimizes : 822 rollbacks : 0 expungeDeletes : 0 docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 cumulative_adds : 17457 cumulative_deletesById : 1959 cumulative_deletesByQuery : 0 cumulative_errors : 0 my problem is with the cumulative part. If for instance I am doing a commit after each add and delete operation then the sum of cumulative_adds plus cumulative_deletes plus cumulative_errors should much the commit number. is that right? And another question, these stats are since SOLR instance startup or since update handler startup, these can differ as far as I understand... and from this part: docsPending : 0 adds : 0 deletesById : 0 deletesByQuery : 0 errors : 0 I understand that if I had docsPending I should have adds(pending) deletes*(pending) but how could I have errors... I'm fairly sure that adds and deletes refer to the number of documents added or deleted. You can have many documents added and/or deleted for each commit. I would not expect the sums to match, unless you are adding or deleting only one document at a time and doing a commit after every one. I hope you're not doing that, unless you're using trunk with the near-realtime feature and doing soft commits, with which I have no experience. Normally doing a commit after every document would be too much of a load for good performance, unless there is a relatively long time period between each add or delete. Your question about errors - that probably tracks the number of times that the update handler returned an error response, though I don't really know. If I'm right, then that number, like commits, has little to do with the number of documents. Thanks, Shawn
Re: Stemmer Question
Thanks the KeywordMarkerFilterFactory seems to be what I was looking for. I'm still wondering about keeping the unstemmed word as a token though. While I know that this would increase the index size slightly I wonder what the negative of doing such a thing would be? Just seems less destructive since I always store the unstemmed version and the stemmed version. By not storing the unstemmed version there is no way to go back without reindexing. If I wanted to implement this I'm assuming a custom tokenizer would be most appropriate? Does something like this already exist? Not out-of-the-box. Actually I was using your idea, implemented such custom token filter by mixing synonym filter and stem filter. This is useful for wildcard queries. And for normal queries, this could rank exact matches higher.
Re: solr geospatial / spatial4j
On Wed, Mar 7, 2012 at 7:25 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I'm researching options for handling a better geospatial solution. I'm currently using Solr 3.5 for a read-only database, and the point/radius searches work great. But I'd like to start doing point in polygon searches as well. I've skimmed through some of the geospatial jira issues, and read about spaitial4j, which is very interesting. I see on the github page that this will soon be part of lucene, can anyone confirm this? perhaps -- see the discussion on: https://issues.apache.org/jira/browse/LUCENE-3795 This will involve a few steps before it is actually integrated with the lucene project -- and then a few more to be usable from solr I attempted to build the spatial4j demo but no luck. It had problems finding lucene 4.0-SNAPSHOT, which I guess is because there are no 4.0-SNAPSHOT nightly builds? If anyone knows how I can get around this, please let me know! ya they are published -- you just have to specify where you want to pull them from. If you use the 'updateLucene' profile, it will pull them from: https://repository.apache.org/content/groups/snapshots/ use: mvn clean install -P updateLucene Other than spatial4j, is there a way to do point in polgyon searches with solr 3.5.0 right now? Is there some tricky indexing/querying strategy that would allow this? I don't know of anything else -- and note that polygon stuff has a ways to go before it is generally ready for prime-time. ryan
Re: Solr-Lucene compatibility
: I have an app the writes lucene indexes and is based on lucene 2.3.0. : : Can I read those indexes using solr 3.5.0 and perform a distributed search? : Or should I use a lower version of solr, so that the index reader is : compatible with the index writer. a) Lucene 2.3.0 is pretty damn ancient ... i would strongly recomend you upgrade to get a lot of bug fixes and performance improvements b) in general, writing indexes with Lucene and searching them with (a compatible version of) Solr should work fine -- provided the schema.xml you configure Solr with matches up with how you've built your index. -Hoss
Re: How to exactly match fields which are multi-valued?
Well, if you really want EXACT exact, just use a KeywordTokenizer (ie, not tokenize at all). But then matches will really have to be EXACT, including punctuation, whitespace, diacritics, etc. But a query will only match if it 'exactly' matches one value in your multi-valued field. You could try a KeywordTokenizer with some normalization too. Either way, though, if you're issuing a query to a field tokenized with KeywordTokenizer that can include whitespace in it's values, you really need to issue it as a _phrase query_, to avoid being messed up by the lucene or dismax query parser's pre tokenization. Which is potentially fine, that's what you want to do anyway for 'exact match'. Except if you wanted to use dismax multiple qf's with just a BOOST on the 'exact match', but _not_ a phrase query for other fields... well, I can't figure out any way to do it with this technique. It gets tricky, I haven't found a great solution. On 3/8/2012 7:44 AM, Erick Erickson wrote: You haven't really given us much to go on here. Matches are just like a single valued field with the exception of the increment gap. Say one entry were large cat big dog in a multi-valued field. ay the next document indexed two values, large cat big dog And, say the increment gap were 100. The token offsets for doc 1 would be 0, 1, 2, 3 and for doc 2 would be 0, 1, 101, 102 The only effective difference is that phrase queries with slop less than 100 would NEVER match across multi-values. I.e. cat big~10 would match doc1 but not doc 2 Best Erick 2012/3/7 SuoNayisuonayi2...@163.com: Hi all, how to offer exact-match capabilities on the multi-valued fields? Any helps are appreciated! SuoNayi
Re: Filter facet_fields with Solr similar to stopwords
: I am using a solr.StopFilterFactory in a query filter for a text_general : field (here: content). It works fine, when I query the field for the : stopword, then I am getting no results. ... : used in the text. What I am trying to achieve is, to also filter the : stopwords from the facet_fields, but it's not working. It would only work : if the stopwords are also used during the indexing of the text_general : field, right? ... : My current solution is to 'filter' with code after retrieving the : facet_fields from Solr. But is there a Solr-based way to do this niftier? Not really. field.facet works based on the terms in the index -- if the term is in the index, and it's in the documents matching your query, you are going to get counts back for it. -Hoss
Re: Retrieving multiple levels with hierarchical faceting in Solr
: I've found a couple of discussions online that suggest I ought to be : able to set the prefix using local params: : : facet.field={!prefix=0;}foo : facet.field={!prefix=1_foovalue; key=bar}foo citation please? as far as i know that has ever been implemented, but the idea was floated arround as a hypothetical. There is an open feature request for this type of logic, and it has a patch, but that patch doesn't work against any recent version (contributions to get it up to snuff would certianly be welcome)... https://issues.apache.org/jira/browse/SOLR-1351 https://issues.apache.org/jira/browse/SOLR-2251 -Hoss
Re: maxClauseCount Exception
: I am suddenly getting a maxClauseCount exception for no reason. I am : using Solr 3.5. I have only 206 documents in my index. Unless things have changed the reason you are seeing this is because _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting it into a giant boolean query of all the terms in that field -- so even if you only have 206 docs, if you have more then 206 values in that field in your index, you're going to go over 1024 terms. (you don't get this problem in a basic query, because it doens't need to enumerate all the terms, it rewrites it to a ConstantScoreQuery) what you most likeley want to do, is move some of those clauses like type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and into fq filters ... so they can be cached independently, won't contribute to scoring (just matching) and won't be used in highlighting. : params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| : org.apache.solr.servlet.SolrDispatchFilter| : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery : $TooManyClauses: maxClauseCount is set to 1024 : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) ... : at : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) : at : org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) -Hoss
Re:two solr instances using one index
the two solr instances are used to provide a failover. can i define the priority of the two instances? -- Original -- From: 我自己的邮箱345804...@qq.com; Date: Thu, Mar 8, 2012 02:05 PM To: solr-usersolr-user@lucene.apache.org; Subject: two solr instances using one index Hi, everyone 2 solr server nodes point to the same data directory (same index). did the two solr instances work independently ? i found it was strange : one node (node0) can do complex search(for example:q:diseasesort=dateCreated), but the other(node1) using the same search reported out o f memory. (the java -Xmx4G is enough) and i tried to start node1 first after we kill node0 (if i kept node0 running , i can never start node1 without heap size error! Which will impact the performance of node1 to perform complex search) , any complex search can complete well. did anybody meet the problem ever and any ideal about it ? ps: my solr version is 1.3
Re: indexing bigdata
Your question is really unanswerable, there are about a zillion factors that could influence the answer. I can index 5-7K docs/second so it's efficient. Others can index only a fraction of that. It all depends... Try it and see is about the only way to answer. Best Erick On Thu, Mar 8, 2012 at 1:35 PM, Sharath Jagannath shotsonclo...@gmail.com wrote: Is indexing around 30 Million documents in a single solr instance efficient? Has somebody experimented it? Planning to use it for an autosuggest feature I am implementing, so expecting the response in few milliseconds. Should I be looking at sharding? Thanks, Sharath
Re: How to index doc file in solr?
Have you looked at ExtractingRequestHandler (aka Solr Cell)? SolrJ? Tika? Perhaps if you defined the problem a bit more we'd be able to give you more comprehensive answers Best Erick On Wed, Mar 7, 2012 at 12:14 AM, Rohan Ashok Kumbhar rohan_kumb...@infosys.com wrote: Hi, I would like to know how to index any document other than xml in SOLR ? Any comments would be appreciated !!! Thanks, Rohan CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Reporting tools
Are there any reporting tools out there? So I can analyzer search term frequency, filter frequency, etc?
Re: Inconsistent Results with ZooKeeper Ensemble and Four SOLR Cloud Nodes
All, I recreated the cluster on my machine at home (Windows 7, Java 1.6.0.23, apache-solr-4.0-2012-02-29_09-07-30) , sent some document through Manifold using its crawler, and it looks like it's replicating fine once the documents are committed. This must be related to my environment somehow. Thanks for your help. Regards, Matt On Fri, Mar 2, 2012 at 9:06 AM, Erick Erickson erickerick...@gmail.comwrote: Matt: Just for paranoia's sake, when I was playing around with this (the _version_ thing was one of my problems too) I removed the entire data directory as well as the zoo_data directory between experiments (and recreated just the data dir). This included various index.2012 files and the tlog directory on the theory that *maybe* there was some confusion happening on startup with an already-wonky index. If you have the energy and tried that it might be helpful information, but it may also be a total red-herring FWIW Erick On Thu, Mar 1, 2012 at 8:28 PM, Mark Miller markrmil...@gmail.com wrote: I assuming the windows configuration looked correct? Yeah, so far I can not spot any smoking gun...I'm confounded at the moment. I'll re read through everything once more... - Mark -- This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.
Re: Stemmer Question
I'd be very interested to see how you did this if it is available. Does this seem like something useful to the community at large? On Thursday, March 8, 2012, Ahmet Arslan iori...@yahoo.com wrote: Thanks the KeywordMarkerFilterFactory seems to be what I was looking for. I'm still wondering about keeping the unstemmed word as a token though. While I know that this would increase the index size slightly I wonder what the negative of doing such a thing would be? Just seems less destructive since I always store the unstemmed version and the stemmed version. By not storing the unstemmed version there is no way to go back without reindexing. If I wanted to implement this I'm assuming a custom tokenizer would be most appropriate? Does something like this already exist? Not out-of-the-box. Actually I was using your idea, implemented such custom token filter by mixing synonym filter and stem filter. This is useful for wildcard queries. And for normal queries, this could rank exact matches higher.
Re: indexing bigdata
Ok, My bad. I should have put it in a better way. Is it good idea to have all the 30M docs on a single instance, or should I consider distributed set-up. I have synthesized the data and the have configured schema and have made suitable changes to the config. Have tested out with a smaller data-set on my laptop and have a good work flow set-up. I do not have a big machine and test it out. Wanted to make sure I have insight in either option I have before I decide to spin-up an amazon instance. Thanks, Sharath On Thu, Mar 8, 2012 at 6:18 PM, Erick Erickson erickerick...@gmail.comwrote: Your question is really unanswerable, there are about a zillion factors that could influence the answer. I can index 5-7K docs/second so it's efficient. Others can index only a fraction of that. It all depends... Try it and see is about the only way to answer. Best Erick On Thu, Mar 8, 2012 at 1:35 PM, Sharath Jagannath shotsonclo...@gmail.com wrote: Is indexing around 30 Million documents in a single solr instance efficient? Has somebody experimented it? Planning to use it for an autosuggest feature I am implementing, so expecting the response in few milliseconds. Should I be looking at sharding? Thanks, Sharath
Re: addBean method inserting multivalued values
I have not specified the multivalued attribute. dynamicField name=*_i type=integer indexed=true stored=true/ I have different integer properties in my java class, some are single integer values, some are integer arrays. What I want is if the setter method is expecting an integer then the field stored must be single valued. But all integer dynamic fields are being indexed as multivalued. Please not that this happens only when I use addBeans method. If I construct a SolrDocument then indexing works as expected. On Wed, Feb 1, 2012 at 3:43 PM, darul daru...@gmail.com wrote: remove multivalue=true in your schema.xml file ? -- View this message in context: http://lucene.472066.n3.nabble.com/addBean-method-inserting-multivalued-values-tp3692511p3706126.html Sent from the Solr - User mailing list archive at Nabble.com.