Request to be added to the ContributorsGroup
Hi, My username is KumarLImbu and I would like to be added to the Contributors Group. Could somebody please help me? Best Regards, Kumar
Re: Usage of CloudSolrServer?
CloudSolrServer uses LBHttpSolrServer by default. CloudSolrServer connects to Zookeeper and passes the live nodes to LBHttpSolrServer. LBHttpSolrServer connects each node as round robin. By the way do you mean leader instead of master? 2013/7/12 sathish_ix skandhasw...@inautix.co.in Hi , Iam using cloudsolrserver to connect to solrcloud, im indexing the documents using solrj API using cloudsolrserver object. Index is triggered on master node of a collection, whereas if i need to find the status of the loading , it return the message from replica where status is null. How to find which instance the cloudsolrserver is connecting ? -- View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-CloudSolrServer-tp4056052p4077471.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Leader Election, when?
If you want to plan to have 2 shards and if you start up the first node it will be the leader of first shard. When you start up second node it will be the leader of second shard. If you start up third node it will be the replica of first shard. If you start up fourth node it will be the replica of second shard. If you start up fifth node it will be the replica of first shard ... and this will continue as like that as round robin. 2013/7/11 aabreur alexandre.ab...@vtex.com.br I have a working Zookeeper ensemble running with 3 instances and also a solrcloud cluster with some solr instances. I've created a collection with settings to 2 shards. Then i: create 1 core on instance1 create 1 core on instance2 create 1 core on instance1 create 1 core on instance2 Just to have this configuration: instance1: shard1_leader, shard2_replica instance2: shard1_replica, shard2_leader If i add 2 cores to instance1 then 2 cores to instance2, both leaders will be on instance1 and no re-election is done. instance1: shard1_leader, shard2_leader instance2: shard1_replica, shard2_replica Back to my ideal scenario (detached leaders), also when i add a third instance with 2 replicas and kill one of my instances running a leader, the election picks the instance that already have a leader. My question is why Zookeeper takes this behavior. Shouldn't it distribute leaders? If i deliver some stress to a double-leader instance, is Zookeeper going to run an election? -- View this message in context: http://lucene.472066.n3.nabble.com/Leader-Election-when-tp4077381.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper
If you have one collection you just need to define hostnames of Zookeeper ensembles and run that command once. 2013/7/11 Zhang, Lisheng lisheng.zh...@broadvision.com Hi, We are testing solr 4.3.0 in Tomcat (considering upgrading solr 3.6.1 to 4.3.0), in WIKI page for solrCloud in Tomcat: http://wiki.apache.org/solr/SolrCloudTomcat we need to link each collection explicitly: /// 8) Link uploaded config with target collection java -classpath .:/home/myuser/solr-war-lib/* org.apache.solr.cloud.ZkCLI -cmd linkconfig -collection mycollection -confname ... /// But our application has many cores (a few thousands which all share same schema/config, is there a moe convenient way ? Thanks very much for helps, Lisheng
Re: Performance of cross join vs block join
Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. Could you point me to some more documentation? Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Solr 4.3 Shard distributed request check probably incorrect?
Hi, we are using Solr 4.3 with regular sharding without ZooKeeper. I see the following errors inside our logs: 14995742 [qtp427093680-2249] INFO org.apache.solr.core.SolrCore - [DE1] webapp=/solr path=/select params={mm=266%25tie=0.1ids=1060691781qf=Title^1.2+Description^0.01+Keywords^0.4+ArtikelNumber^0.1distrib=falseq.alt=*:*wt=javabinversion=2rows=10defType=edismaxpf=%0aTitle^1.5+Description^0.3%0a+NOW=1373459092416shard.url= 172.31.4.63:8080/solr/DE1fl=%0aPID,updated,score%0a+start=0q=9783426647240bf=%0a%0a+partialResults=truetimeAllowed=5000isShard=truefq=Price:[*+TO+9]fq=ShopId1+8+10+12+2975)ps=100} status=0 QTime=2 14995742 [qtp427093680-2255] ERROR org.apache.solr.servlet.SolrDispatchFilter - null:java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:727) at org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:588) at org.apache.solr.handler.component.QueryComponent.distributedProcess(QueryComponent.java:541) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:244) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) Because this is obviously a distributed request isShard=truedistrib=false shouldn't this request evaluate to a non-distributed request? It seems that the ResponseBuilder is marked as isDistrib == true, because otherwise it wouldn't execute the distributedProcess or am I wrong? Best regards, Hans
About Suggestions
Hi Solr people! We need to suggest part numbers in alphabetically order adding up to four characters to the already entered part number prefix. That works quite well with terms component acting on a multivalued field with keyword tokenizer and edge nGram filter. I am mentioning part numbers to indicate that each item in the multivalued field is a string without whitespace and where special characters like dashes cannot be seen as separators. Is there a way to know if the term (the suggestion) represents such a complete part number (without doing another query for each suggestion)? Since we are using SolJ, what we would need is something like boolean Term.isRepresentingCompleteFieldValue() Thanks, Alexander
Re: Performance of cross join vs block join
On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.comwrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Search with punctuations
Hi, Scenario: User who perform search forget to put punctuation mark (apostrophe) for ex, when user wants to search for a value like INT'L, they just key in INTL (with no punctuation). In this scenario, I wish to return both values with INTL and INT'L that currently are indexed on SOLR instance. Currently, if I search for INTL it wont return the row having value INT'L. Schema Configuration entry for the field type: fieldType name=customStr class=solr.TextField positionIncrementGap=100 sortMissingLast=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=\s*[,.]\s* replacement= replace=all / filter class=solr.PatternReplaceFilterFactory pattern=\s+ replacement= replace=all / filter class=solr.PatternReplaceFilterFactory pattern=[';] replacement= replace=all / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=\s*[,.]\s* replacement= replace=all / filter class=solr.PatternReplaceFilterFactory pattern=\s+ replacement= replace=all / filter class=solr.PatternReplaceFilterFactory pattern=[';] replacement= replace=all/ /analyzer /fieldType Please suggest as to what mechanism should I use to fetch both the values like INTL and INT'L, when the search is performed for INTL. Also, does the reg-ex look correct for the analyzers? What all different filters/ tokenizer can be used to overcome this issue. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Search-with-punctuations-tp4077510.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to set a condition on the number of docs found
Do you want to modify Solr source code? Did you check that line at XMLWriter.java : *writeAttr(numFound,Long.toString(numFound));* 2013/7/12 Matt Lieber mlie...@impetus.com Hello there, I would like to be able to know whether I got over a certain threshold of doc results. I.e. Test (Result.numFound 10 ) - true. Is there a way to do this ? I can't seem to find how to do this; (other than have to do this test on the client app, which is not great). Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Solr Live Nodes not updating immediately
Hi, tickTime in zookeeper was high. When i reduced it to 2000ms solr node status gets updated in 20s. Hence resolved my issue. Thanks for helping me. I have one more question. 1. Is it advisable to reduce the tickTime further. 2. Or whats the most appropriate tickTime which gives maximum performance and also solr node gets updated in lesser time. I hereby included my zoo.cfg configuration tickTime=2000 dataDir=/home/local/ranjith-1785/sources/solrcloud/zookeeper-3.4.5_Server1/zoodata clientPort = 2181 initLimit=5 syncLimit=2 maxClientCnxns=180 server.1=localhost:2888:3888 server.2=localhost:3000:4000 server.3=localhost:2500:3500 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Live-Nodes-not-updating-immediately-tp4076560p4077467.html Sent from the Solr - User mailing list archive at Nabble.com.
Custom processing in Solr Request Handler plugin and its debugging ?
Hi, I have defined my new Solr RequestHandler plugin like this in SolrConfig.xml requestHandler name=/myendpoint class=com.abc.MyRequestPlugin /requestHandler And its working fine. Now I want to do some custom processing from my this plugin by making a search query to regular '/select' handler. requestHandler name=/select class=solr.SearchHandler /requestHandler And then receive the results back from '/select' handler and perform some custom processing on those results and send the response back to my custom /myendpoint handler. And for this I need help on how to make a call to '/select' handler from within the .MyRequestPlugin class and perform some calculation on the results. I also need some help on how to debug my plugin ? As its .jar is been deployed to solr_hom/lib ... how can I attach my plugin's code in eclipse to Solr process so I could debug it when user will send request to my plugin. Thanks, Tony
SolrCloud group.query error shard X did not set sort field values or how i can set fillFields=true on IndexSearcher.search
Hi! To repeat the problem, do the following 1. Start a node1 of SolrCloud (4.3.1 default configs) (java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -jar start.jar) 2. Import to collection1 - shard1 some data 3. Try group.query e.g. http://node1:8983/solr/collection1/select?q=*:*group=truegroup.query=someFiled:someValue. it is important to have hit on index data. 4. The result is, there is no error 5. Start a node2 of SolrCloud (java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar) 6. On node2 add new core for collection1 - shard2. Default core collection1 unload. We have one collection over two shard. Shard1 - have data, shard2 - no data. 7. Again try group.query http://node1:8983/solr/collection1/select?q=*:*group=truegroup.query=someFiled:someValue . 8. Error: shard 0 did not set sort field values (FieldDoc.fields is null); you must pass fillFields=true to IndexSearcher.search on each shard How i can set fillFields=true to IndexSearcher.search ? Thanks in advance, Evgeny
How to optimize a search?
Hello folks, I'm doing a search for a specific word (Rocket Banana) in a specific field and the document with the result Rocket Banana (Single) never comes first..and this is the result that should appear in first position...i've tried to many ways to perform this search: title:Rocket Banana title:(Rocket AND Banana) title:(Rocket OR Banana) title:(Rocket^0.175 AND Banana^0.175) title:(Rocket^0.175 ORBanana^0.175) The order returned is basically like: docfloat name=score12.106901/floatstr name=titleRocket Rocket/str/doc docfloat name=score12.007204/floatstr name=titleRocket/str/doc docfloat name=score12.007203/floatstr name=titleBanana Banana Banana/str/doc a lot of results docfloat name=score10.398543/floatstr name=titleRocket Banana (Single)/str/doc How can i optimize my search and return the document that have the full word that i've searched with a higher scores then others? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to optimize a search?
_Why_ should Rocket Banana (Single) come first? Essentially you have some ordering in mind and unless you can express it clearly you'll _never_ get ideal ranking. Really. But your particular issue can probably be solved by adding a clause like OR rocket banana^5 And I suspect you haven't given us the entire query, or you're running through edismax or whatever. In future, please paste the result of adding debug=all to the e-mail. Best Erick On Fri, Jul 12, 2013 at 7:32 AM, padcoe davidpadi...@gmail.com wrote: Hello folks, I'm doing a search for a specific word (Rocket Banana) in a specific field and the document with the result Rocket Banana (Single) never comes first..and this is the result that should appear in first position...i've tried to many ways to perform this search: title:Rocket Banana title:(Rocket AND Banana) title:(Rocket OR Banana) title:(Rocket^0.175 AND Banana^0.175) title:(Rocket^0.175 ORBanana^0.175) The order returned is basically like: docfloat name=score12.106901/floatstr name=titleRocket Rocket/str/doc docfloat name=score12.007204/floatstr name=titleRocket/str/doc docfloat name=score12.007203/floatstr name=titleBanana Banana Banana/str/doc a lot of results docfloat name=score10.398543/floatstr name=titleRocket Banana (Single)/str/doc How can i optimize my search and return the document that have the full word that i've searched with a higher scores then others? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531.html Sent from the Solr - User mailing list archive at Nabble.com.
Patch review request: SOLR-5001 (adding book links to the website)
Hello, As per earlier email thread, I have created a patch for Solr website to incorporate links to my new book. It would be nice if somebody with commit rights for the (markdown) website could look at it before the book's Solr version (4.3.1) stops being the latest :-) I promise to help with the new Wiki/Guide later in return. https://issues.apache.org/jira/browse/SOLR-5001 Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Does Solrj Batch Processing Querying May Confuse?
I've crawled some webpages and indexed them at Solr. I've queried data at Solr via Solrj. url is my unique field and I've define my query as like that: ModifiableSolrParams params = new ModifiableSolrParams(); params.set(q, lang:tr); params.set(fl, url); params.set(sort, url desc); I've run my program to query 1000 rows at each query and wrote them in a file. However I realized that there are some documents that are indexed at Solr (I query them from admin page, not from Solrj as a 1000 row batch process) but is not at my file. What may be the problem for that?
Re: Problem using Term Component in solr
bq: Note:Term Component works only on string dataType field. :( Not true. Term Component will work on any indexed field. It'll bring back the _tokens_ that have been indexed though, which are often individual words so your examples medical physics would be two separate tokens so it may be puzzling. A general request, please don't put bold text. I know it's an attempt to help direct attention to the important bits, but (at least in gmail in my browser) bolds are replaced by * before and after, which especially when looking at wildcard questions is really confusing G. But I have to ask you to back up a bit. _Why_ are you using TermsComponent to search titles? Why not use Solr for what it's good for and just search a _tokenized_ title field? This feels like an XY problem. Best Erick On Thu, Jul 11, 2013 at 2:55 AM, Parul Gupta(Knimbus) parulgp...@gmail.com wrote: Hi All I am using *Term component* in solr for searching titles with short form using wild card characters(.*) and [a-z0-9]*. I am using *Term Component* specifically as wild card characters are not working on *select?q=* query search. Examples of some *title *are: 1)Medicine, Health Care and Philosophy 2)Medical Physics 3)Physics of fluids 4)Medical Engineering and Physics ***When i do *solr query*: localhost:8080/solr3.6/OA/terms?terms.fl=titleterms.regex=phy.* fluidsterms.regex.flag=case_insensitiveterms.limit=10 *Output* is 3rd title: *Physics of fluids* This is relevant output. ***But when i do *solr query*: localhost:8080/solr3.6/OA/terms?terms.fl=titleterms.regex=med.* phy.*terms.regex.flag=case_insensitiveterms.limit=10 *Output* are 2nd and 4th title: *Medical Engineering and Physics* *Medical Physics* This is irrelevant.I want only one result for this query i.e. *Medical Physics* *Although i have changed my wild card characters to *[a-z0-9]** instead of *.** ,but than first query doesn't work as '*of*' is included in '*Physics of fluids*'.However Second query works fine . example of query is: localhost:8080/solr3.6/OA/terms?terms.fl=titleterms.regex=med[a-z0-9]* phy[a-z0-9]*terms.regex.flag=case_insensitiveterms.limit=10 This works fine,gives one output *Medical Physics*. If there is another way for searching using *Term Component* or without using it..Please suggest to neglect such stop words. Note:Term Component works only on string dataType field. :( -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr caching clarifications
Inline On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you can get the maxDoc number from your Solr admin page). Plus some overhead for storing the fq text, but that's usually not much. This is for each entry up to Size. queryResultCache is usually trivial unless you've configured it extravagantly. It's the query string length + queryResultWindowSize integers per entry (queryResultWindowSize is from solrconfig.xml). 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? It's just a limit on the queryResultCache entry size as far as I can tell. But again this cache is relatively small, I'd be surprised if it used significant resources. 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. Yes. This a cache (I think) for the _contents_ of the documents you'll be returning to be manipulated by various components during the life of the query. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) Not sure, but I don't think this will contribute much to memory pressure. This is about now many fields are loaded to get a single value from a doc in the results list, and since one is usually working with 20 or so docs this is usually a small amount of memory. 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Again, I don't think this is much of a memory consumer, although I confess I don't know the internals. Merging is mostly about I/O. Thanks in advance, Manu But take a look at the admin page, you can see how much memory various caches are using by looking at the plugins/stats section. Best Erick
RE: What happens in indexing request in solr cloud if Zookeepers are all dead?
Thanks very much for your clear explanation! -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, July 11, 2013 1:55 PM To: solr-user@lucene.apache.org Subject: Re: What happens in indexing request in solr cloud if Zookeepers are all dead? Sorry, no updates if no Zookeepers. There would be no way to assure that any node knows the proper configuration. Queries are a little safer using most recent configuration without zookeeper, but update consistency requires accurate configuration information. -- Jack Krupansky -Original Message- From: Zhang, Lisheng Sent: Thursday, July 11, 2013 2:59 PM To: solr-user@lucene.apache.org Subject: RE: What happens in indexing request in solr cloud if Zookeepers are all dead? Yes, I should not have used word master/slave for solr cloud! So if all Zookeepers are dead, could indexing requests be handled properly (could solr remember the setting for indexing)? Thanks very much for helps, Lisheng -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday, July 11, 2013 10:46 AM To: solr-user@lucene.apache.org Subject: Re: What happens in indexing request in solr cloud if Zookeepers are all dead? There are no masters or slaves in SolrCloud - it is fully distributed and master-free. Leaders are temporary and can vary over time. The basic idea for quorum is to prevent split brain - two (or more) distinct sets of nodes (zookeeper nodes, that is) each thinking they constitute the authoritative source for access to configuration information. The trick is to require (N/2)+1 nodes for quorum. For n=3, quorum would be (3/2)+1 = 1+1 = 2, so one node can be down. For n=1, quorum = (1/2)+1 = 0 + 1 = 1. For n=2, quorum would be (2/2)+1 = 1 + 1 = 2, so no nodes can be down. IOW, for n=2 no nodes can be down for the cluster to do updates. -- Jack Krupansky -Original Message- From: Zhang, Lisheng Sent: Thursday, July 11, 2013 9:28 AM To: solr-user@lucene.apache.org Subject: What happens in indexing request in solr cloud if Zookeepers are all dead? Hi, In solr cloud latest doc, it mentioned that if all Zookeepers are dead, distributed query still works because solr remembers the cluster state. How about the indexing request handling if all Zookeepers are dead, does solr needs Zookeeper to know which box is master and which is slave for indexing to work? Could solr remember master/slave relations without Zookeeper? Also doc said Zookeeper quorum needs to have a majority rule so that we must have 3 Zookeepers to handle the case one instance is crashed, what would happen if we have two instances in quorum and one instance is crashed (or quorum having 3 instances but two of them are crashed)? I felt the last one should take over? Thanks very much for helps, Lisheng
Re: How to boost relevance based on distance and age..
the first thing I'd try would be FunctionQueries, see: http://wiki.apache.org/solr/FunctionQuery. Be a little careful. You have disjoint conditions, i.e. one or the other should be used so you'll have two function queries, basically expressing if (age 20 years) if (age = 20 years) The one that _doesn't_ apply should return 1, not 0 since it'll be multiplied by the score. Best Erick On Thu, Jul 11, 2013 at 11:03 AM, Vineel vine...@visionsoft-inc.com wrote: Here is the structure of the solr document doc str name=latlong52.401790,4.936660/str date name=dateOfBirth1993-12-09T00:00:00Z/date /doc would like to search for document's based on the following weighted criteria.. - distance 0-10miles weight 40 - distance 10miles and above weight 20 - Age 0-20years weight 20 - Age 20years and above weight 10 wondering what are the recommended approaches to build SOLR queries for this? Thanks -Vineel -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-boost-relevance-based-on-distance-and-age-tp4077330.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper
Sorry I might not have asked clearly, our issue is that we have a few thousand collections (can be much more), so running that command is rather tedius, is there a simpler way (all collections share same schema/config)? Thanks very much for helps, Lisheng -Original Message- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Friday, July 12, 2013 1:17 AM To: solr-user@lucene.apache.org Subject: Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper If you have one collection you just need to define hostnames of Zookeeper ensembles and run that command once. 2013/7/11 Zhang, Lisheng lisheng.zh...@broadvision.com Hi, We are testing solr 4.3.0 in Tomcat (considering upgrading solr 3.6.1 to 4.3.0), in WIKI page for solrCloud in Tomcat: http://wiki.apache.org/solr/SolrCloudTomcat we need to link each collection explicitly: /// 8) Link uploaded config with target collection java -classpath .:/home/myuser/solr-war-lib/* org.apache.solr.cloud.ZkCLI -cmd linkconfig -collection mycollection -confname ... /// But our application has many cores (a few thousands which all share same schema/config, is there a moe convenient way ? Thanks very much for helps, Lisheng
Re: Request to be added to the ContributorsGroup
Done, at least to the Solr contributor's group, if you want Lucene, let me know. Added exactly as KumarLImbu, don't know whether 1 both the L and I should be capitalized 2 whether the rights-checking cares. Thanks! Erick On Fri, Jul 12, 2013 at 2:51 AM, Kumar Limbu kumarli...@gmail.com wrote: Hi, My username is KumarLImbu and I would like to be added to the Contributors Group. Could somebody please help me? Best Regards, Kumar
Re: Leader Election, when?
This is probably not all that important to worry about. The additional duties of a leader are pretty minimal. And the leaders will shift around anyway as you restart servers etc. Really feels like a premature optimization. Best Erick On Thu, Jul 11, 2013 at 3:53 PM, aabreur alexandre.ab...@vtex.com.br wrote: I have a working Zookeeper ensemble running with 3 instances and also a solrcloud cluster with some solr instances. I've created a collection with settings to 2 shards. Then i: create 1 core on instance1 create 1 core on instance2 create 1 core on instance1 create 1 core on instance2 Just to have this configuration: instance1: shard1_leader, shard2_replica instance2: shard1_replica, shard2_leader If i add 2 cores to instance1 then 2 cores to instance2, both leaders will be on instance1 and no re-election is done. instance1: shard1_leader, shard2_leader instance2: shard1_replica, shard2_replica Back to my ideal scenario (detached leaders), also when i add a third instance with 2 replicas and kill one of my instances running a leader, the election picks the instance that already have a leader. My question is why Zookeeper takes this behavior. Shouldn't it distribute leaders? If i deliver some stress to a double-leader instance, is Zookeeper going to run an election? -- View this message in context: http://lucene.472066.n3.nabble.com/Leader-Election-when-tp4077381.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Live Nodes not updating immediately
On 7/11/2013 11:11 PM, Ranjith Venkatesan wrote: tickTime in zookeeper was high. When i reduced it to 2000ms solr node status gets updated in 20s. Hence resolved my issue. Thanks for helping me. I have one more question. 1. Is it advisable to reduce the tickTime further. 2. Or whats the most appropriate tickTime which gives maximum performance and also solr node gets updated in lesser time. I hereby included my zoo.cfg configuration tickTime=2000 dataDir=/home/local/ranjith-1785/sources/solrcloud/zookeeper-3.4.5_Server1/zoodata clientPort = 2181 initLimit=5 syncLimit=2 maxClientCnxns=180 server.1=localhost:2888:3888 server.2=localhost:3000:4000 server.3=localhost:2500:3500 Here's mine, comments removed. Except for dataDir, these are all default values found in the zookeeper download and on the zookeeper website: tickTime=2000 initLimit=10 syncLimit=5 dataDir=zoodata clientPort=2181 server.1=zoo1.REDACTED.com:2888:3888 server.2=zoo2.REDACTED.com:2888:3888 server.3=zoo3.REDACTED.com:2888:3888 http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper I hope your config is a dev install, because if all your zookeepers are running on the same server, you have no redundancy in the face of a server failure. Servers do fail, even if they have all the redundancy features you can buy. Thanks, Shawn
Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper
On 7/12/2013 7:29 AM, Zhang, Lisheng wrote: Sorry I might not have asked clearly, our issue is that we have a few thousand collections (can be much more), so running that command is rather tedius, is there a simpler way (all collections share same schema/config)? When you create each collection with the Collections API (http calls), you tell it the name of a config set stored in zookeeper. You can give all your collections the same config set if you like. If you manually create collections with the CoreAdmin API instead, you must use the zkcli script included in Solr to link the collection to the config set, which can be done either before or after the collection is created. The zkcli script provides some automation for the java command that you were given by Furkan. Thanks, Shawn
Re: How to set a condition on the number of docs found
Hmmm. One way is: http://localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.field=idfacet.offset=10rows=0facet.limit=1http://hgsolr2devmstr:8983/solr/providersearch/select/?q=*%3A*facet=truefacet.field=cityfacet.offset=10rows=0facet.limit=1 If you have a result you have results 10. Another way is to just look at it wth a facet.query and have your app deal with it. http:/localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.query={!lucene%20key=numberofresults}state:COrows=0http://hgsolr2devmstr:8983/solr/providersearch/select/?q=*%3A*facet=truefacet.query={!lucene%20key=numberofresults}state:COrows=0 On Thu, Jul 11, 2013 at 11:45 PM, Matt Lieber mlie...@impetus.com wrote: Hello there, I would like to be able to know whether I got over a certain threshold of doc results. I.e. Test (Result.numFound 10 ) - true. Is there a way to do this ? I can't seem to find how to do this; (other than have to do this test on the client app, which is not great). Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: How to set a condition on the number of docs found
Test where? I mean, numFound is right there at the top of the query results, right? Unfortunately there is no function query value source equivalent to numFound. There is numdocs, but that is the total documents in the index. There is also docfreq(term), which could be used in a function query (including the fl parameter) if you know a term that has a 1-to-1 relationship to your query results. It is worth filing a Jira to add numfound() as a function query value source. -- Jack Krupansky -Original Message- From: Matt Lieber Sent: Friday, July 12, 2013 1:45 AM To: solr-user@lucene.apache.org Subject: How to set a condition on the number of docs found Hello there, I would like to be able to know whether I got over a certain threshold of doc results. I.e. Test (Result.numFound 10 ) - true. Is there a way to do this ? I can't seem to find how to do this; (other than have to do this test on the client app, which is not great). Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Norms
Thanks. Yeah I don't really want the queryNorm on On Wed, Jul 10, 2013 at 2:39 AM, Daniel Collins danwcoll...@gmail.comwrote: I don't know the full answer to your question, but here's what I can offer. Solr offers 2 types of normalisation, FieldNorm and QueryNorm. FieldNorm is as the name suggests field level normalisation, based on length of the field, and can be controlled by the omitNorms parameter on the field. In your example, fieldNorm is always 1.0, see below, so that suggests you have correctly turned off field normalisation on the name_edgy field. 1.0 = fieldNorm(field=name_edgy, doc=231378) QueryNorm is what I'm still trying to get to the bottom of exactly :) But its something that tries to normalise the results of different term queries so they are broadly comparable. You haven't supplied the query you've run , but based on the qf, bf, I'm assuming it breaks down into a DisMax query on 3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure that the results of those 3 queries can be compared. The exact details of it I'm still trying to get to the bottom of (any volunteers with more info chip in!) From earlier answers to the list, queryNorm is calculated in the Similarity object, I need to dig further, but that's probably a good place to start. On 10 July 2013 04:57, William Bell billnb...@gmail.com wrote: I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1)) -- Bill Bell billnb...@gmail.com cell 720-256-8076 -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Is it possible to find a leader from a list of cores in solr via java code
Hi, As per the suggestions above I shifted my focus to using CloudSolrServer. In terms of sending updates to the leaders and reducing network traffic it works great. But i faced one problem in using CloudSolrServer is that it opens too many connections as large as five thousand connections. My Code is as follows ModifiableSolrParams params = new ModifiableSolrParams(); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 3); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 2); HttpClient client = HttpClientUtil.createClient(params); LBHttpSolrServer lbServer = new LBHttpSolrServer(client); server = new CloudSolrServer(zkHost,lbServer); server.setDefaultCollection(defaultColllection); If there is only one instance of solr up then this works great. But in 1 shard 1 replica system it opens up too many connections in waiting state. Am I doing something incorrect. Any help would be highly appreciated -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-find-a-leader-from-a-list-of-cores-in-solr-via-java-code-tp4074994p4077587.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What does too many merges...stalling in indexwriter log mean?
Thanks Shawn, Do you have any feeling for what gets traded off if we increase the maxMergeCount? This is completely new for us because we are experimenting with indexing pages instead of whole documents. Since our average document is about 370 pages, this means that we have increased the number of documents we are asking Solr to index by a couple of orders of magnitude. (on the other hand the size of the document decreases by a couple of orders of magnitude). I'm not sure why increasing the number of documents (and reducing their size) is causing more merges. I'll have to investigate. Tom On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey s...@elyograg.org wrote: On 7/11/2013 1:47 PM, Tom Burton-West wrote: We are seeing the message too many merges...stalling in our indexwriter log. Is this something to be concerned about? Does it mean we need to tune something in our indexing configuration? It sounds like you've run into the maximum number of simultaneous merges, which I believe defaults to two, or maybe three. The following config section in indexConfig will likely take care of the issue. This assumes 3.6 or later, I believe that on older versions, this goes in indexDefaults. mergeScheduler class=org.apache.lucene.**index.** ConcurrentMergeScheduler int name=maxThreadCount1/int int name=maxMergeCount6/int /mergeScheduler Looking through the source code to confirm, this definitely seems like the case. Increasing maxMergeCount is likely going to speed up your indexing, at least by a little bit. A value of 6 is probably high enough for mere mortals, buy you guys don't do anything small, so I won't begin to speculate what you'll need. If you are using spinning disks, you'll want maxThreadCount at 1. If you're using SSD, then you can likely increase that value. Thanks, Shawn
Multiple queries or Filtering Queries in Solr
My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index. So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index. To summarise this, Index of 5000 will be filtered to 500 and then 50 (500050050). Its basically filtering but I would like to do this in Solr. I have reasonable basic knowledge and still learning. Update: If represented mathematically it would look like this: results1=f(query1) results2=f(query2, results1) final_results=f(query3, results2) I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-queries-or-Filtering-Queries-in-Solr-tp4077574.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to set a condition over stats result
sum(x, y, z) = x + y + z (sums those specific fields values for the current document) sum(x, y) = x + y (sum of those two specific field values for the current document) sum(x) = field(x) = x (the specific field value for the current document) The sum function in function queries is not an aggregate function. Ditto for min and max. -- Jack Krupansky -Original Message- From: mihaela olteanu Sent: Friday, July 12, 2013 1:44 AM To: solr-user@lucene.apache.org Subject: Re: How to set a condition over stats result What if you perform sub(sum(myfieldvalue),100) 0 using frange? From: Jack Krupansky j...@basetechnology.com To: solr-user@lucene.apache.org Sent: Friday, July 12, 2013 7:44 AM Subject: Re: How to set a condition over stats result None that I know of, short of writing a custom search component. Seriously, you could hack up a copy of the stats component with your own logic. Actually... this may be a case for the new, proposed Script Request Handler, which would let you execute a query and then you could do any custom JavaScript logic you wanted. When we get that feature, it might be interesting to implement a variation of the standard stats component as a JavaScript script, and then people could easily hack it such as in your request. Fascinating. -- Jack Krupansky -Original Message- From: Matt Lieber Sent: Thursday, July 11, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: How to set a condition over stats result Hello, I am trying to see how I can test the sum of values of an attribute across docs. I.e. Whether sum(myfieldvalue)100 . I know I can use the stats module which compiles the sum of my attributes on a certain facet , but how can I perform a test this result (i.e. Is sum100) within my stats query? From what I read, it's not supported yet to perform a function on the stats module.. Any other way to do this ? Cheers, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: What does too many merges...stalling in indexwriter log mean?
On 7/12/2013 9:23 AM, Tom Burton-West wrote: Do you have any feeling for what gets traded off if we increase the maxMergeCount? This is completely new for us because we are experimenting with indexing pages instead of whole documents. Since our average document is about 370 pages, this means that we have increased the number of documents we are asking Solr to index by a couple of orders of magnitude. (on the other hand the size of the document decreases by a couple of orders of magnitude). I'm not sure why increasing the number of documents (and reducing their size) is causing more merges. I'll have to investigate. I'm not sure that you lose anything, really. If everything is proceeding normally before the stalling message is logged, I would not expect it to cause ANY problems. The reason that I increased this value was because when I did a full-import of millions of documents from mysql, I would reach the point where there were three different levels of merges going on at once. Because the default thread count is one, only the largest merge was actually occurring, the others were queued and waiting. With three merges stacked up at once, I had passed the maxMergeCount threshold, so *indexing* stopped. It can take several minutes for a very large merge to finish, so indexing stopped long enough that the MySQL server would drop the connection established by the JDBC driver. Once the merge finished and DIH tried to resume indexing, the connection was gone and it would fail the entire import. I have never seen more than three merge levels happening at once, so a value of 6 is probably overkill, but shouldn't be a problem. The true goal is to make sure that indexing never stops, not to push the system limits. The maxThreadCount parameter should prevent I/O from becoming a problem. Thanks, Shawn
RE: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper
Thanks very much for all the helps! -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, July 12, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper On 7/12/2013 7:29 AM, Zhang, Lisheng wrote: Sorry I might not have asked clearly, our issue is that we have a few thousand collections (can be much more), so running that command is rather tedius, is there a simpler way (all collections share same schema/config)? When you create each collection with the Collections API (http calls), you tell it the name of a config set stored in zookeeper. You can give all your collections the same config set if you like. If you manually create collections with the CoreAdmin API instead, you must use the zkcli script included in Solr to link the collection to the config set, which can be done either before or after the collection is created. The zkcli script provides some automation for the java command that you were given by Furkan. Thanks, Shawn
Re: Patch review request: SOLR-5001 (adding book links to the website)
Hi Alexandre, I'll work on this today. Steve On Jul 12, 2013, at 8:26 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, As per earlier email thread, I have created a patch for Solr website to incorporate links to my new book. It would be nice if somebody with commit rights for the (markdown) website could look at it before the book's Solr version (4.3.1) stops being the latest :-) I promise to help with the new Wiki/Guide later in return. https://issues.apache.org/jira/browse/SOLR-5001 Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Performance of cross join vs block join
Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more details, if something is still unclear. have you saw my benchmark http://blog.griddynamics.com/2012/08/block-join-query-performs.html ? On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hello, Does anyone know about some measurements in terms of performance for cross joins compared to joins inside a single index? Is it faster the join inside a single index that stores all documents of various types (from parent table or from children tables)with a discriminator field compared to the cross join (basically in this case each document type resides in its own index)? I have performed some tests but to me it seems that having a join in a single index (bigger index) does not add too much speed improvements compared to cross joins. Why a block join would be faster than a cross join if this is the case? What are the variables that count when trying to improve the query execution time? Thanks! Mihaela -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer,
Re: Problem using Term Component in solr
Hi, Ok I will not use Bold text in my queries I guess my question is not clear to you See what I am doing is, i have a live source say 'A' and a stored database say it as 'B'.ok A and B ,both have title fields in them.Consider A as non-persistent solr and B as persistent solr. I have to match the title coming from A to the database B. Since some title from live source A comes in short form e.g 'med. phys.' and 'phys. fluids'. But corresponding to these titles my database B have titles 'medical physics' and 'physics of fluids'. Since this type of differences occurs and A not able to search there corresponding titles in B by using 'tokenized' field 'title' with using wild cards,hence i used Term component first.Which gives me the corresponding matched title with B.When i got the full title like 'medical physics',i fetched it from HTML,and then again search it in tokenized field of 'title' say it 'titlenew'(copy field of title) which brings me result 'medical physics'.But I am failing to get match of 'phys. fluids' with 'physics of fluids' as it has stop word in it using [a-z0-9]*. Hope know u will get my issue...and will help.. thanks.. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077628.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: expunging deletes
OK Thanks Shawn, I went with this because 10 wasn't working for us and it looks like my index is staying under 20 GB now with numDocs : 16897524 and maxDoc : 19048053 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce5/int int name=segmentsPerTier5/int int name=maxMergeAtOnceExplicit15/int double name=maxMergedSegmentMB6144.0/double double name=reclaimDeletesWeight6.0/double /mergePolicy -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday, July 10, 2013 5:34 PM To: solr-user@lucene.apache.org Subject: Re: expunging deletes On 7/10/2013 5:58 PM, Petersen, Robert wrote: Using solr 3.6.1 and the following settings, I am trying to run without optimizes. I used to optimize nightly, but sometimes the optimize took a very long time to complete and slowed down our indexing. We are continuously indexing our new or changed data all day and night. After a few days running without an optimize, the index size has nearly doubled and maxdocs is nearly twice the size of numdocs. I understand deletes should be expunged on merges, but even after trying lots of different settings for our merge policy it seems this growth is somewhat unbounded. I have tried sending an optimize with numSegments = 2 which is a lot lighter weight then a regular optimize and that does bring the number down but not by too much. Does anyone have any ideas for better settings for my merge policy that would help? Here is my current index snapshot too: Your merge settings are the equivalent of the old mergeFactor set to 35, and based on the fact that you have the Explicit set to 105, I'm guessing your settings originally came from something I posted - these are the numbers that I use. These settings can result in a very large number of segments on your disk. Because you index a lot (and probably reindex existing documents often), I can understand why you have high merge settings, but if you want to eliminate optimizes, you'll need to go lower. The default merge setting of 10 (with an Explicit value of 30) is probably a good starting point, but you might need to go even smaller. On Solr 3.6, an optimize probably cannot take place at the same time as index updates -- the optimize would probably delay updates until after it's finished. I remember running into problems on Solr 3.x, so I set up my indexing program to stop updates while the index was optimizing. Solr 4.x should lift any restriction where optimizes and updates can't happen at the same time. With an index size of 25GB, a six-drive RAID10 should be able to optimize in 10-15 minutes, but if your I/O system is single disk, RAID1, RAID5, or RAID6, the write performance may cause this to take longer. If you went with SSD, optimizes would happen VERY fast. Thanks, Shawn
Re: How to set a condition on the number of docs found
Thanks William, I'll do that. Matt On 7/12/13 7:38 AM, William Bell billnb...@gmail.com wrote: Hmmm. One way is: http://localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.field=id; facet.offset=10rows=0facet.limit=1http://hgsolr2devmstr:8983/solr/provi dersearch/select/?q=*%3A*facet=truefacet.field=cityfacet.offset=10rows =0facet.limit=1 If you have a result you have results 10. Another way is to just look at it wth a facet.query and have your app deal with it. http:/localhost:8983/solr/core/select/?q=*%3A*facet=truefacet.query={!lu cene%20key=numberofresults}state:COrows=0http://hgsolr2devmstr:8983/solr /providersearch/select/?q=*%3A*facet=truefacet.query={!lucene%20key=numb erofresults}state:COrows=0 On Thu, Jul 11, 2013 at 11:45 PM, Matt Lieber mlie...@impetus.com wrote: Hello there, I would like to be able to know whether I got over a certain threshold of doc results. I.e. Test (Result.numFound 10 ) - true. Is there a way to do this ? I can't seem to find how to do this; (other than have to do this test on the client app, which is not great). Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Bill Bell billnb...@gmail.com cell 720-256-8076 NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Save Solr index in database
hi I wanted to understand if it is possible to store/save Solr indexes to the database instead of the filesystem. I checked out some articles where lucene can do it. Hence I assume Solr can too but its not clear to me how to configure Solr to save the indexes in the database instead in the /index directory. Any help is really appreciated as I think I have hit a wall with this. -- View this message in context: http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance of cross join vs block join
Hello Roman, Thanks for your interest. I briefly looked on your approach, and I'm really interested in your numbers. Here is the trivial code, I'd rather prefer rely on your testing framework, and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you need it? https://github.com/m-khl/join-tester What you are saying about benchmark representativeness definitely makes sense. I didn't try to establish a complete absolutely representative benchmark. Just wanted to have rough numbers, related for my usecase, certainly. I'm from eCommerce, that volume was enough for me. What I didn't get is, 'not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment'. Usually, there is no problem with blocks in multi segment index, block definitely can't span across segments. Anyway, please elaborate. One of block join benefits is an ability to hit only the first matched child in group, and jump over followings. It doesn't applicable in general, but get huge gain some times. On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Mikhail, I have commented on your blog, but it seems I have done st wrong, as the comment is not there. Would it be possible to share the test setup (script)? I have found out that the crucial thing with joins is the number of 'joins' [hits returned] and it seems that the experiments I have seen so far were geared towards small collection - even if Erick's index was 26M, the number of hits was probably small - you can see a very different story if you face some [other] real data. Here is a citation network and I was comparing lucene join's [ie not the block joins, because these cannot be used for citation data - we cannot reasonably index them into one segment]) https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png Notice, the y axes is sqrt, so the running time for lucene join is growing and growing very fast! It takes lucene 30s to do the search that selects 1M hits. The comparison is against our own implementation of a similar search - but the main point I am making is that the join benchmarks should be showing the number of hits selected by the join operation. Otherwise, a very important detail is hidden. Best, roman On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com wrote: Hi Mikhail, I have used wrong the term block join. When I said block join I was referring to a join performed on a single core versus cross join which was performed on multiple cores. But I saw your benchmark (from cache) and it seems that block join has better performance. Is this functionality available on Solr 4.3.1? nope SOLR-3076 awaits for ages. I did not find such examples on Solr's wiki page. Does this functionality require a special schema, or a special indexing? Special indexing - yes. How would I need to index the data from my tables? In my case anyway all the indices have a common schema since I am using dynamic fields, thus I can easily add all documents from all tables in one Solr core, but for each document to add a discriminator field. correct. but notion of ' discriminator field' is a little bit different for blockjoin. Could you point me to some more documentation? I can recommend only those http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.youtube.com/watch?v=-OiIlIijWH0 Thanks in advance, Mihaela From: Mikhail Khludnev mkhlud...@griddynamics.com To: solr-user solr-user@lucene.apache.org; mihaela olteanu mihaela...@yahoo.com Sent: Thursday, July 11, 2013 2:25 PM Subject: Re: Performance of cross join vs block join Mihaela, For me it's reasonable that single core join takes the same time as cross core one. I just can't see which gain can be obtained from in the former case. I hardly able to comment join code, I looked into, it's not trivial, at least. With block join it doesn't need to obtain parentId term values/numbers and lookup parents by them. Both of these actions are expensive. Also blockjoin works as an iterator, but join need to allocate memory for parents bitset and populate it out of order that impacts scalability. Also in None scoring mode BJQ don't need to walk through all children, but only hits first. Also, nice feature is 'both side leapfrog' if you have a highly restrictive filter/query intersects with BJQ, it allows to skip many parents and children as well, that's not possible in Join, which has fairly 'full-scan' nature. Main performance factor for Join is number of child docs. I'm not sure I got all your questions, please specify them in more
Re: Save Solr index in database
And why would you want to do that? Seems rather wrong direction to march in. I am assuming relational database. There is a commercial solution that integrates Solr into Cassandra, if I understood it correctly: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-solr Even then, there might be some stuff on the filesystem. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jul 12, 2013 at 2:30 PM, sagarmj76 sagarm_jad...@yahoo.com wrote: hi I wanted to understand if it is possible to store/save Solr indexes to the database instead of the filesystem. I checked out some articles where lucene can do it. Hence I assume Solr can too but its not clear to me how to configure Solr to save the indexes in the database instead in the /index directory. Any help is really appreciated as I think I have hit a wall with this. -- View this message in context: http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Save Solr index in database
On 7/12/2013 12:30 PM, sagarmj76 wrote: hi I wanted to understand if it is possible to store/save Solr indexes to the database instead of the filesystem. I checked out some articles where lucene can do it. Hence I assume Solr can too but its not clear to me how to configure Solr to save the indexes in the database instead in the /index directory. Any help is really appreciated as I think I have hit a wall with this. If Lucene can do it, then theoretically Solr can do so as well. You could very likely add jars to your classpath (to add a Directory and DirectoryFactory implementation that uses a database) and reference that class in the Solr config, but unless the class provided a way to configure itself, you probably wouldn't be able to specify its config within Solr's config without custom plugin code. A burning question ... why would you want to do this? Lucene and Solr are highly optimized to work well with a local filesystem. That is the path that will give you the best performance. Thanks, Shawn
Re: Save Solr index in database
The reason for going that route is because our application is clustered and if the indexing information is on the filesystem, I am not sure whether that would be replicated. At the same time since its a product it needs to be packaged with the product and also from a proprietary reason we are not allowed to use the filesystem. -- View this message in context: http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077662.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Save Solr index in database
On 13 July 2013 00:19, Shawn Heisey s...@elyograg.org wrote: On 7/12/2013 12:30 PM, sagarmj76 wrote: hi I wanted to understand if it is possible to store/save Solr indexes to the database instead of the filesystem. I checked out some articles where lucene can do it. Hence I assume Solr can too but its not clear to me how to configure Solr to save the indexes in the database instead in the /index directory. Any help is really appreciated as I think I have hit a wall with this. [...] As others have noted, think twice about why you would want to do this. Lucene does it through JdbcDirectory but as far as I know this is only an interface without a concrete implementation, though apparently third-party libraries are available that implement JdbcDirectory. The Lucene FAQ notes that this is slow: http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F Regards, Gora
Re: Save Solr index in database
On 7/12/2013 12:51 PM, Sagar Jadhav wrote: The reason for going that route is because our application is clustered and if the indexing information is on the filesystem, I am not sure whether that would be replicated. At the same time since its a product it needs to be packaged with the product and also from a proprietary reason we are not allowed to use the filesystem. Solr can do replication from a master server to slaves. If you implement as SolrCloud, then you would have a clustered solution with no master/slave designations. SolrCloud requires a three server minimum for a robust deployment. The third server can be a wimpy thing that only runs zookeeper. Putting your index in a DB is just a bad idea. It would be hard to find help with it, and performance would not be good. Thanks, Shawn
Re: Save Solr index in database
I think that makes a lot of sense as I was reading the Solr Cloud technique. Thanks a lot Shawn for the validation. Thanks a lot everyone for helping me out to go in the right direction. I really appreciate all the inputs. I will now go back and get the exception for getting access to the filesytem. -- View this message in context: http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077673.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Save Solr index in database
If they ask, tell them that Solr *is* a database. Databases store their stuff on a file system, so your data is gonna end up there in the end. Putting Solr indexes inside a database is like storing mysql tables in Oracle. Upayavira On Fri, Jul 12, 2013, at 08:18 PM, Sagar Jadhav wrote: I think that makes a lot of sense as I was reading the Solr Cloud technique. Thanks a lot Shawn for the validation. Thanks a lot everyone for helping me out to go in the right direction. I really appreciate all the inputs. I will now go back and get the exception for getting access to the filesytem. -- View this message in context: http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077673.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem using Term Component in solr
Is the vocabulary known? That is, do you know the abbreviations that will be used? If so, you could consider synonyms, in which case you'd go to tokenized titles and use phrase queries to get your matches... Regexes often don't scale extremely well, although the 4.x FST implementations are much faster than they used to be. It seems to me that regularizing the titles is a better idea than trying to fake it with regexes, but you know your problem space better than me... Best Erick On Fri, Jul 12, 2013 at 1:32 PM, Parul Gupta(Knimbus) parulgp...@gmail.com wrote: Hi, Ok I will not use Bold text in my queries I guess my question is not clear to you See what I am doing is, i have a live source say 'A' and a stored database say it as 'B'.ok A and B ,both have title fields in them.Consider A as non-persistent solr and B as persistent solr. I have to match the title coming from A to the database B. Since some title from live source A comes in short form e.g 'med. phys.' and 'phys. fluids'. But corresponding to these titles my database B have titles 'medical physics' and 'physics of fluids'. Since this type of differences occurs and A not able to search there corresponding titles in B by using 'tokenized' field 'title' with using wild cards,hence i used Term component first.Which gives me the corresponding matched title with B.When i got the full title like 'medical physics',i fetched it from HTML,and then again search it in tokenized field of 'title' say it 'titlenew'(copy field of title) which brings me result 'medical physics'.But I am failing to get match of 'phys. fluids' with 'physics of fluids' as it has stop word in it using [a-z0-9]*. Hope know u will get my issue...and will help.. thanks.. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077628.html Sent from the Solr - User mailing list archive at Nabble.com.
solr autodetectparser tikaconfig dataimporter error
i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = rollback data-config.xml: dataConfig dataSource type=3DBinURLDataSource name=3Ddata/ dataSource type=3DURLDataSource = baseUrl=3Dhttp://127.0.0.1/tkb/internet/; name=3Dmain/ document entity name=3Drec processor=3DXPathEntityProcessor = url=3DdocImport.xml forEach=3D/albums/album dataSource=3Dmain=20 field column=3Dtitle xpath=3D//title / field column=3Did xpath=3D//file / field column=3Dcontents xpath=3D//description / field column=3Dpath xpath=3D//path / field column=3DAuthor xpath=3D//author / =09 =09 =09 entity processor=3DTikaEntityProcessor = url=3Dfile:///C:\web\development\tkb\internet\public\download\online\${re= c.id} dataSource=3Ddata onerror=3Dskip field column=3Dcontents name=3Dtext / /entity /entity /document /dataConfig the lib are included and declared in the logs, i have also tried = tika-app 1.0 and tagsoup 1.2 with the same result. can someone please help, i = don't know where to start looking for the error.
add to ContributorsGroup
Hi. Could you add me (KenGeis) to the Solr Wiki ContributorsGroup? I'd like to fix some typos. Thanks, Ken Geis
Re: add to ContributorsGroup
Done, Thanks for helping! Erick On Fri, Jul 12, 2013 at 4:30 PM, Ken Geis kg...@speakeasy.net wrote: Hi. Could you add me (KenGeis) to the Solr Wiki ContributorsGroup? I'd like to fix some typos. Thanks, Ken Geis
add to ContributorsGroup - Instructions for setting up SolrCloud on jboss
Hello, Can you please add me to the ContributorsGroup? I would like to add instructions for setting up SolrCloud using Jboss. thanks.
Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss
username: saqib On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib docbook@gmail.com wrote: Hello, Can you please add me to the ContributorsGroup? I would like to add instructions for setting up SolrCloud using Jboss. thanks.
Re: Norms
Norms stay in the index even if you delete all of the data. If you just changed the schema, emptied the index, and tested again, you've still got norms in there. You can examine the index with Luke to verify this. On 07/09/2013 08:57 PM, William Bell wrote: I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
I am getting a java.lang.OutOfMemoryError: Requested array size exceeds VM limit on certain queries. Please advise: 19:25:02,632 INFO [org.apache.solr.core.SolrCore] (http-oktst1509.company.tld/12.5.105.96:8180-9) [collection1] webapp=/solr path=/select params={sort=sent_date+ascdistrib=falsewt=javabinversion=2rows=2147483647df=textfl=idshard.url= 12.5.105.96:8180/solr/collection1/NOW=1373675102627start=0q=thread_id:1439513570014188310isShard=truefq=domain:company.tld+AND+owner:11782344fsv=true} hits=1 status=0 QTime=1 19:25:02,637 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-oktst1509.company.tld/12.5.105.96:8180-2) null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
zero-valued retrieval scores
when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html; 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: zero-valued retrieval scores
Did you put a boost of 0.0 on the documents, as opposed to the default of 1.0? x * 0.0 = 0.0 -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: zero-valued retrieval scores when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html; 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: zero-valued retrieval scores
Yes, you are right, the boost on these documents are 0. I didn't provide them, though. I suppose the boost scores come from Nutch (yes, my solr indexes crawled web docs). What could be wrong? again, what exactly is the formula for fieldNorm? On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.comwrote: Did you put a boost of 0.0 on the documents, as opposed to the default of 1.0? x * 0.0 = 0.0 -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: zero-valued retrieval scores when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: zero-valued retrieval scores
For the calculation of norm, see note number 6: http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html You would need to talk to the Nutch guys to see why THEY are setting document boost to 0.0. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 11:57 PM To: solr-user@lucene.apache.org Subject: Re: zero-valued retrieval scores Yes, you are right, the boost on these documents are 0. I didn't provide them, though. I suppose the boost scores come from Nutch (yes, my solr indexes crawled web docs). What could be wrong? again, what exactly is the formula for fieldNorm? On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.comwrote: Did you put a boost of 0.0 on the documents, as opposed to the default of 1.0? x * 0.0 = 0.0 -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: zero-valued retrieval scores when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!
Re: zero-valued retrieval scores
Thanks, Jack! On Fri, Jul 12, 2013 at 9:37 PM, Jack Krupansky j...@basetechnology.comwrote: For the calculation of norm, see note number 6: http://lucene.apache.org/core/**4_3_0/core/org/apache/lucene/** search/similarities/**TFIDFSimilarity.htmlhttp://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html You would need to talk to the Nutch guys to see why THEY are setting document boost to 0.0. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 11:57 PM To: solr-user@lucene.apache.org Subject: Re: zero-valued retrieval scores Yes, you are right, the boost on these documents are 0. I didn't provide them, though. I suppose the boost scores come from Nutch (yes, my solr indexes crawled web docs). What could be wrong? again, what exactly is the formula for fieldNorm? On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky j...@basetechnology.com* *wrote: Did you put a boost of 0.0 on the documents, as opposed to the default of 1.0? x * 0.0 = 0.0 -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Friday, July 12, 2013 10:31 PM To: solr-user@lucene.apache.org Subject: zero-valued retrieval scores when I search a keyword (such as apple), most of the docs carry 0.0 as score. Here is an example from explain: str name= http://www.bloomberg.com/slideshow/2013-07-12/world-at-** **work-india.htmlhttp://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html http://www.**bloomberg.com/slideshow/2013-** 07-12/world-at-work-india.htmlhttp://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html ** 0.0 = (MATCH) fieldWeight(content:appl in 51), product of: 1.0 = tf(termFreq(content:appl)=1) 2.096877 = idf(docFreq=5190, maxDocs=15546) 0.0 = fieldNorm(field=content, doc=51) Can somebody help me understand why fieldNorm is 0? What exactly is the formula for computing fieldNorm? Thanks!