How to reindex in solr
Hi all, I have my solr indexed completely and now i have added a new field in the schema which is a copyfield of another field. Please suggest me how can i reindex solr without going through formal process which i did for the first time because there are some fields whose data is really time consuming to obtain. I have been trying to reindex from solrj and indexes well except for those fields which have been mentioned as stored=false. Now on the production servers these indexing is getting failed because of the out of memory swap space. Please suggest some good method to reindex using lucene indexes with even stored=false. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-reindex-in-solr-tp3550871p3550871.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Seek past EOF
We are using ext3 on Debian. Noticed today that i only need to reload the core to get it working again…. On 30 November 2011 19:59, Simon Willnauer simon.willna...@googlemail.comwrote: can you give us some details about what filesystem you are using? simon On Wed, Nov 30, 2011 at 3:07 PM, Ruben Chadien ruben.chad...@aspiro.com wrote: Happened again…. I got 3 directories in my index dir 4096 Nov 4 09:31 index.2004083156 4096 Nov 21 10:04 index.2021090440 4096 Nov 30 14:55 index.2029024919 as you can se the first two are old and also empty , the last one from today is and containing 9 files none of the are 0 size and total size 7 GB. The size of the index on the master is 14GB. Any ideas on what to look for ? Thanks Ruben Chadien On 29 November 2011 15:58, Mark Miller markrmil...@gmail.com wrote: Hmm...I've seen a bug like this, but I don't think it would be tickled if you are replicating config files... It def looks related though ... I'll try to dig around. Next time it happens, take a look on the slave for 0 size files - also if the index dir on the slave is plain 'index' or has a timestamp as part of the name (eg timestamp.index). On Tue, Nov 29, 2011 at 9:53 AM, Ruben Chadien ruben.chad...@aspiro.com wrote: Hi, for the moment there are no 0 sized files, but all indexes are working now. I will have to look next time it breaks. Yes, the directory name is index and it replicates the schema and a synonyms file. /Ruben Chadien On 29 November 2011 15:29, Mark Miller markrmil...@gmail.com wrote: Also, on your master, what is the name of the index directory? Just 'index'? And are you replicating config files as well or no? On Nov 29, 2011, at 9:23 AM, Mark Miller wrote: Does the problem index have any 0 size files in it? On Nov 29, 2011, at 2:54 AM, Ruben Chadien wrote: HI all After upgrading tol Solr 3.4 we are having trouble with the replication. The setup is one indexing master with a few slaves that replicate the indexes once every night. The largest index is 20 GB and the master and slaves are on the same DMZ. Almost every night one of the indexes (17 in total) fail after the replication with an EOF file. SEVERE: Error during auto-warming of key:org.apache.solr.search.QueryResultKey@bda006e3 :java.io.IOException: seek past EOF at org.apache.lucene.store.MMapDirectory$MMapIndexInput.seek(MMapDirectory.java:347) at org.apache.lucene.index.SegmentTermEnum.seek(SegmentTermEnum.java:114) at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:203) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:273) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:210) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:507) at org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309) at org.apache.lucene.search.TermQuery$TermWeight$1.add(TermQuery.java:56) at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:77) at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:82) After a restart the errors are gone, anyone else seen this ? Thanks Ruben Chadien - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com -- *Ruben Chadien *Senior Developer Mobile +47 900 35 371 ruben.chad...@aspiro.com * Aspiro Music AS* Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo Tel +47 452 86 900, fax +47 22 37 36 59 www.aspiro.com/music -- - Mark http://www.lucidimagination.com -- *Ruben Chadien *Senior Developer Mobile +47 900 35 371 ruben.chad...@aspiro.com * Aspiro Music AS* Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo Tel +47 452 86 900, fax +47 22 37 36 59 www.aspiro.com/music -- *Ruben Chadien *Senior Developer Mobile +47 900 35 371 ruben.chad...@aspiro.com * Aspiro Music AS* Øvre Slottsgate 25, P.O. Box 8710 Youngstorget, N-0028 Oslo Tel +47 452 86 900, fax +47 22 37 36 59 www.aspiro.com/music
Problem with hunspell french dictionary
Hi, I'm trying to add the HunspellStemFilterFactory to my Solr project. I'm trying this on a fresh new download of Solr 3.5. I downloaded french dictionary here (found it from here http://wiki.services.openoffice.org/wiki/Dictionaries#French_.28France.2C_29): http://www.dicollecte.org/download/fr/hunspell-fr-moderne-v4.3.zip But when I start Solr and go to the Solr Analysis, an error occurs in Solr. Is there the trace : java.lang.RuntimeException: Unable to load hunspell data! [dictionary=en_GB.dic,affix=fr-moderne.aff] at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:82) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:546) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:126) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:461) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 3 at java.lang.String.charAt(Unknown Source) at org.apache.lucene.analysis.hunspell.HunspellDictionary$DoubleASCIIFlagParsingStrategy.parseFlags(HunspellDictionary.java:382) at org.apache.lucene.analysis.hunspell.HunspellDictionary.parseAffix(HunspellDictionary.java:165) at org.apache.lucene.analysis.hunspell.HunspellDictionary.readAffixFile(HunspellDictionary.java:121) at org.apache.lucene.analysis.hunspell.HunspellDictionary.init(HunspellDictionary.java:64) at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:46) I can't find where the problem is. It seems like my dictionary isn't well written for hunspell, but I tried with two different dictionaries, and I had the same problem. I also tried with an english dictionary, and ... it works ! So I think that my french dictionary is wrong for hunspell, but I don't know why ... Can you help me ?
Re: mysolr python client
sounds great for a python project i'm involved in rigth now. I'll take a deeper look on it. thx marco 2011/11/30 Marco Martinez mmarti...@paradigmatecnologico.com Hi all, For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: *http://mysolr.redtuna.org/* * * bye! Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: mysolr python client
On 11/30/2011 05:40 PM, Marco Martinez wrote: For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: http://mysolr.redtuna.org/ Do you know what advantages it has over pysolr or solrpy? On the page it only says mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions. Thanks, Jens
Re: Weird docs-id clustering output in Solr 1.4.1
Hi Stanislaw, did you already have time to create a patch? If not, can you tell me please which lines in which class in source code are relevant? Thanks and regards Vadim Kisselmann 2011/11/29 Vadim Kisselmann v.kisselm...@googlemail.com Hi, the quick and dirty way sound good:) It would be great if you can send me a patch for 1.4.1. By the way, i tested Solr. 3.5 with my 1.4.1 test index. I can search and optimize, but clustering doesn't work (java.lang.Integer cannot be cast to java.lang.String) My uniqieKey for my docs it the id(sint). These here was the error message: Problem accessing /solr/select/. Reason: Carrot2 clustering failed org.apache.solr.common.SolrException: Carrot2 clustering failed at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217) at org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364) at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201) ... 23 more It this case it's better for me to upgrade/patch the 1.4.1 version. Best regards Vadim 2011/11/29 Stanislaw Osinski stanislaw.osin...@carrotsearch.com But my actual live system works on solr 1.4.1. i can only change my solrconfig.xml and integrate new packages... i check the possibility to upgrade from 1.4.1 to 3.5 with the same index (without reinidex) with luceneMatchVersion 2.9. i hope it works... Another option would be to check out Solr 1.4.1 source code, fix the issue and recompile the clustering component. The quick and dirty way would be to convert all identifiers to strings in the clustering component, before the they are returned for serialization (I can send you a patch that does this). The proper way would be to fix the root cause of the problem, but I'd need to dig deeper into the code to find this. Staszek
Re: Problem with hunspell french dictionary
There seems that theres a problem with the code parsing the Dictionary. Can you open a JIRA issue with the same information so we can look into fixing it? On Thu, Dec 1, 2011 at 10:14 PM, Nathan Castelein nathan.castel...@gmail.com wrote: Hi, I'm trying to add the HunspellStemFilterFactory to my Solr project. I'm trying this on a fresh new download of Solr 3.5. I downloaded french dictionary here (found it from here http://wiki.services.openoffice.org/wiki/Dictionaries#French_.28France.2C_29 ): http://www.dicollecte.org/download/fr/hunspell-fr-moderne-v4.3.zip But when I start Solr and go to the Solr Analysis, an error occurs in Solr. Is there the trace : java.lang.RuntimeException: Unable to load hunspell data! [dictionary=en_GB.dic,affix=fr-moderne.aff] at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:82) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:546) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:126) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:461) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 3 at java.lang.String.charAt(Unknown Source) at org.apache.lucene.analysis.hunspell.HunspellDictionary$DoubleASCIIFlagParsingStrategy.parseFlags(HunspellDictionary.java:382) at org.apache.lucene.analysis.hunspell.HunspellDictionary.parseAffix(HunspellDictionary.java:165) at org.apache.lucene.analysis.hunspell.HunspellDictionary.readAffixFile(HunspellDictionary.java:121) at org.apache.lucene.analysis.hunspell.HunspellDictionary.init(HunspellDictionary.java:64) at org.apache.solr.analysis.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:46) I can't find where the problem is. It seems like my dictionary isn't well written for hunspell, but I tried with two different dictionaries, and I had the same problem. I also tried with an english dictionary, and ... it works ! So I think that my french dictionary is wrong for hunspell, but I don't know why ... Can you help me ? -- Chris Male | Software Developer | DutchWorks | www.dutchworks.nl
Error in New Solr version
Hi I am migrating from Solr 1.4 to Solr 3.2. I am getting below error in my logs org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent Could not found satisfactory solution on google. please help thanks Pawan
Re: Error in New Solr version
Hi, comment out the lines with the collapse component in your solrconfig.xml if not need it. otherwise, you're missing the right jar's for this component, or path's to this jars in your solrconfig.xml are wrong. regards vadim 2011/12/1 Pawan Darira pawan.dar...@gmail.com Hi I am migrating from Solr 1.4 to Solr 3.2. I am getting below error in my logs org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent Could not found satisfactory solution on google. please help thanks Pawan
Re: make fuzzy search for phrase
any solutions?? i am just get stuck in this. :( -- View this message in context: http://lucene.472066.n3.nabble.com/make-fuzzy-search-for-phrase-tp3542079p3551203.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: mysolr python client
Hi Marco, Great! Maybe you can add it on the Solr wiki? ( http://wiki.apache.org/solr/IntegratingSolr). Regards, Marc. On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote: On 11/30/2011 05:40 PM, Marco Martinez wrote: For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: http://mysolr.redtuna.org/ Do you know what advantages it has over pysolr or solrpy? On the page it only says mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions. Thanks, Jens
Re: mysolr python client
Done! Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2011/12/1 Marc SCHNEIDER marc.schneide...@gmail.com Hi Marco, Great! Maybe you can add it on the Solr wiki? ( http://wiki.apache.org/solr/IntegratingSolr). Regards, Marc. On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote: On 11/30/2011 05:40 PM, Marco Martinez wrote: For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: http://mysolr.redtuna.org/ Do you know what advantages it has over pysolr or solrpy? On the page it only says mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions. Thanks, Jens
Re: Solr and Ping PHP
Hi, I know it's been a while since you posted this question but I'm experiencing the same problem with my instance of Solr (sometimes ping returns false for no visible reason) and I just wonder if you found the solution. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Ping-PHP-tp2254214p3550917.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Weird docs-id clustering output in Solr 1.4.1
Hi Vadim, I've had limited connectivity, so I couldn't check out the complete 1.4.1 code and test the changes. Here's what you can try: In this file: http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.1/contrib/clustering/src/main/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngine.java?revision=957515view=markup around line 216 you will see: for (Document doc : docs) { docList.add(doc.getField(solrId)); } You need to change this to: for (Document doc : docs) { docList.add(doc.getField(solrId).toString()); } Let me know if this did the trick. Cheers, S. On Thu, Dec 1, 2011 at 10:43, Vadim Kisselmann v.kisselm...@googlemail.comwrote: Hi Stanislaw, did you already have time to create a patch? If not, can you tell me please which lines in which class in source code are relevant? Thanks and regards Vadim Kisselmann 2011/11/29 Vadim Kisselmann v.kisselm...@googlemail.com Hi, the quick and dirty way sound good:) It would be great if you can send me a patch for 1.4.1. By the way, i tested Solr. 3.5 with my 1.4.1 test index. I can search and optimize, but clustering doesn't work (java.lang.Integer cannot be cast to java.lang.String) My uniqieKey for my docs it the id(sint). These here was the error message: Problem accessing /solr/select/. Reason: Carrot2 clustering failed org.apache.solr.common.SolrException: Carrot2 clustering failed at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217) at org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364) at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201) ... 23 more It this case it's better for me to upgrade/patch the 1.4.1 version. Best regards Vadim 2011/11/29 Stanislaw Osinski stanislaw.osin...@carrotsearch.com But my actual live system works on solr 1.4.1. i can only change my solrconfig.xml and integrate new packages... i check the possibility to upgrade from 1.4.1 to 3.5 with the same index (without reinidex) with luceneMatchVersion 2.9. i hope it works... Another option would be to check out Solr 1.4.1 source code, fix the issue and recompile the clustering component. The quick and dirty way would be to convert all identifiers to strings in the clustering component, before the they are returned for serialization (I can send you a patch that does this). The proper way would be to fix the root cause of the problem, but I'd need to dig deeper into the code to find this. Staszek
Re: make fuzzy search for phrase
What did you do to install it? What code line did you start from? 1.4 Solr? 3.1? Fresh trunk update? What jar? The usual method of applying a patch is to get the entire source tree, apply the patch and then re-compile all of solr. Perhaps this page would help: http://wiki.apache.org/solr/HowToContribute Note that this patch is a zip file, not in the usual patch format, so doing this may be a bit tricky. Best Erick On Thu, Dec 1, 2011 at 6:00 AM, meghana meghana.rav...@amultek.com wrote: any solutions?? i am just get stuck in this. :( -- View this message in context: http://lucene.472066.n3.nabble.com/make-fuzzy-search-for-phrase-tp3542079p3551203.html Sent from the Solr - User mailing list archive at Nabble.com.
highlight issue
Hi, I am indexing around 2000 names using solr. highlight flag is on while querying. For some name i am getting the search substring appened at the start. Suppose my search query is *Rak*.In my database i have *Rakesh Chaturvedi * name. I am getting *emRak/ememRak/emesh Chaturvedi* as the response. Same the case with the following names. Search Dhar -- highlight emDhar/ememDhar/emmesh Darshan Search Suda-- highlight emSuda/ememSuda/emrshan Faakir Can someone help me? I am using the following filters for index and query. fieldType name=text_autofill class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ /analyzer /fieldType Thanks and Regards, Radha Krishna Reddy.
Re: when using group=true facet numbers are incorrect
https://issues.apache.org/jira/browse/SOLR-2898 has been created for this. Thanx Martijn! -- View this message in context: http://lucene.472066.n3.nabble.com/when-using-group-true-facet-numbers-are-incorrect-tp3488605p3551741.html Sent from the Solr - User mailing list archive at Nabble.com.
(fq=field1:val1 AND field2:val2) VS fq=field1:val1fq=field2:val2 and filterCache
Hello, Is there any difference in the way things are stored in the filterCache if I do (fq=field1:val1 AND field2:val2) or fq=field1:valfq=field2:val2 eventhough these are logically identical ? What get stored exactly ? Also can you point me to where in the Solr source code this processing happens ? Thank you. Antoine.
Configuring the Distributed
I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work?
Re: (fq=field1:val1 AND field2:val2) VS fq=field1:val1fq=field2:val2 and filterCache
Hello, Quoting http://wiki.apache.org/solr/SolrCaching#filterCache : The filter cache stores the results of any filter queries (fq parameters) that Solr is explicitly asked to execute. (Each filter is executed and cached separately. When it's time to use them to limit the number of results returned by a query, this is done using set intersections.) Finding what best suits your needs probably depends on how field1:val1 and field2:val2 vary all together, ie wether there exist a correlation in issuing field2:val2 knowing that field1:val1 was issued (or the other way) Hope this helps ;-) Tanguy Le 01/12/2011 16:01, Antoine LE FLOC'H a écrit : Hello, Is there any difference in the way things are stored in the filterCache if I do (fq=field1:val1 AND field2:val2) or fq=field1:valfq=field2:val2 eventhough these are logically identical ? What get stored exactly ? Also can you point me to where in the Solr source code this processing happens ? Thank you. Antoine.
Re: highlight issue
Suppose my search query is *Rak*.In my database i have *Rakesh Chaturvedi * name. I am getting *emRak/ememRak/emesh Chaturvedi* as the response. Same the case with the following names. Search Dhar -- highlight emDhar/ememDhar/emmesh Darshan Search Suda-- highlight emSuda/ememSuda/emrshan Faakir Can someone help me? I am using the following filters for index and query. fieldType name=text_autofill class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ /analyzer /fieldType I don't think Highlighter can support n-gram field. Can you try to comment out EdgeNGramFilterFactory and re-index then highlight? koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/
Solr cache size information
Hello, If anybody can help, I'd like to confirm a few things about Solr's caches configuration. If I want to calculate cache size in memory relativly to cache size in solrconfig.xml For Document cache size in memory = size in solrconfig.xml * average size of all fields defined in fl parameter ??? For Filter cache size in memory = size in solrconfig.xml * WHAT (the size of an id) ??? (I don't use facet.enum method) For Query result cache size in memory = size in solrconfig.xml * the size of an id ??? I would also like to know relation between solr's caches sizes and JVM max size? If anyone has an answer or a link for further reading to suggest, it would be greatly appreciated. Thanks, Elisabeth
Re: Weird docs-id clustering output in Solr 1.4.1
Hi Stanislaw, unfortunately it doesn't work. I changed the line 216 with the new toString()-part and rebuild the source. still the same behavior, without errors(because of changes). an another line to change? Thanks and regards Vadim 2011/12/1 Stanislaw Osinski stanislaw.osin...@carrotsearch.com Hi Vadim, I've had limited connectivity, so I couldn't check out the complete 1.4.1 code and test the changes. Here's what you can try: In this file: http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.1/contrib/clustering/src/main/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngine.java?revision=957515view=markup around line 216 you will see: for (Document doc : docs) { docList.add(doc.getField(solrId)); } You need to change this to: for (Document doc : docs) { docList.add(doc.getField(solrId).toString()); } Let me know if this did the trick. Cheers, S. On Thu, Dec 1, 2011 at 10:43, Vadim Kisselmann v.kisselm...@googlemail.comwrote: Hi Stanislaw, did you already have time to create a patch? If not, can you tell me please which lines in which class in source code are relevant? Thanks and regards Vadim Kisselmann 2011/11/29 Vadim Kisselmann v.kisselm...@googlemail.com Hi, the quick and dirty way sound good:) It would be great if you can send me a patch for 1.4.1. By the way, i tested Solr. 3.5 with my 1.4.1 test index. I can search and optimize, but clustering doesn't work (java.lang.Integer cannot be cast to java.lang.String) My uniqieKey for my docs it the id(sint). These here was the error message: Problem accessing /solr/select/. Reason: Carrot2 clustering failed org.apache.solr.common.SolrException: Carrot2 clustering failed at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:217) at org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:364) at org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:201) ... 23 more It this case it's better for me to upgrade/patch the 1.4.1 version. Best regards Vadim 2011/11/29 Stanislaw Osinski stanislaw.osin...@carrotsearch.com But my actual live system works on solr 1.4.1. i can only change my solrconfig.xml and integrate new packages... i check the possibility to upgrade from 1.4.1 to 3.5 with the same index (without reinidex) with luceneMatchVersion 2.9. i hope it works... Another option would be to check out Solr 1.4.1 source code, fix the issue and recompile the clustering component. The quick and dirty way would be to convert all identifiers to strings in the
switching on hl.requireFieldMatch reducing highlighted fields returned
I have a query which is highlighting 3 snippets in 1 field, and 1 snippet in another field. By enabling hl.requireFieldMatch, only the latter highlighted field is returned. from this... lst name=highlighting lst name=348231 arr name=content_stemmed str plc Whetstone Temporary [hl-on]Sales[hl-off] Assistant Customer service Cashier work 08 /str str and customer queries. 07 / 99 – 2003 Debenhams Central London [hl-on]Sales[hl-off] Adviser Customer /str str Central London [hl-on]Sales[hl-off] Assistant Customer service; Visual merchandising; Dealing /str str with telephone enquiries; Assisted in the [hl-on]production[hl-off] of jewellery, e.g. setting stones /str /arr arr name=skills_stemmed str [hl-on]product[hl-off] knowledge [hl-on]sales[hl-off] experience /str /arr /lst /lst to this... lst name=highlighting lst name=348231 arr name=skills_stemmed strproduct knowledge [hl-on]sales[hl-off] experience/str /arr /lst /lst I'm doing this so the word product and it's variants are NOT highlighted - they match against a different field.
Re: mysolr python client
Hi Jens, Our objective with mysolr was to create a pythonic Apache Solr binding. But we also have been working in speed and concurrency. We always use the Python QueryResponseWriter, because it avoids us dependencies (a XML or JSON parser). We would also like to create a complete concurrent API, but at the moment only querying is working. Our main goal is to keep evolving mysolr with the feedback we receive from the community. I hope I have answered your questions. Thanks for your interest, Rubén Abad rua...@gmail.com On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote: On 11/30/2011 05:40 PM, Marco Martinez wrote: For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: http://mysolr.redtuna.org/ Do you know what advantages it has over pysolr or solrpy? On the page it only says mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions. Thanks, Jens
DataImportHandler w/ multivalued fields
Hello Solr Community! I am implementing a data connection to Solr through the Data Import Handler and non-multivalued fields are working correctly, but multivalued fields are not getting indexed properly. I am new to DataImportHandler, but from what I could find, the entity is the way to go for multivalued field. The weird thing is that data is being indexed for one row, meaning first raw_tag gets populated. Anyone have any ideas? Thanks, Briggs This is the relevant part of the schema: field name =raw_tag type=text_en_lessAggressive indexed=true stored=false multivalued=true/ field name =raw_tag_string type=string indexed=false stored=true multivalued=true/ copyField source=raw_tag dest=raw_tag_string/ And the relevant part of data-import.xml: document name=merchant entity name=site query=select * from site field column=siteId name=siteId / field column=domain name=domain / field column=aliasFor name=aliasFor / field column=title name=title / field column=description name=description / field column=requests name=requests / field column=requiresModeration name=requiresModeration / field column=blocked name=blocked / field column=affiliateLink name=affiliateLink / field column=affiliateTracker name=affiliateTracker / field column=affiliateNetwork name=affiliateNetwork / field column=cjMerchantId name=cjMerchantId / field column=thumbNail name=thumbNail / field column=updateRankings name=updateRankings / field column=couponCount name=couponCount / field column=category name=category / field column=adult name=adult / field column=rank name=rank / field column=redirectsTo name=redirectsTo / field column=wwwRequired name=wwwRequired / field column=avgSavings name=avgSavings / field column=products name=products / field column=nameChecked name=nameChecked / field column=tempFlag name=tempFlag / field column=created name=created / field column=enableSplitTesting name=enableSplitTesting / field column=affiliateLinklock name=affiliateLinklock / field column=hasMobileSite name=hasMobileSite / field column=blockSite name=blockSite / entity name=merchant_tags pk=siteId query=select raw_tag, freetags.id, freetagged_objects.object_id as siteId from freetags inner join freetagged_objects on freetags.id=freetagged_objects.tag_id where freetagged_objects.object_id='${site.siteId}' field column=raw_tag name=raw_tag/ /entity /entity /document
spatial search or null
Hi, how would I go about constructing a solr 3.2 spatial query that would return documents that are in a specified radius OR documents that have no location information. The query would have a similar result as this: q=City:San Diego OR -City:['' TO *] Thanks
RE: Solr cache size information
For Filter cache size in memory = size in solrconfig.xml * WHAT (the size of an id) ??? (I don't use facet.enum method) As I understand it, size is the number queries that will be cached. My short experience means that the memory consumed will be data dependent. If you have a huge number of documents matched in a FQ, then the size consumed will be very large, if you get a single match then the cached result will take much less memory. I don't know if there is a way you can bound the cache by memory rather than results. I think all of the solr caches behave this way, but I am not sure. NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: spatial search or null
Recently had this myself... http://wiki.apache.org/solr/SpatialSearch#How_to_combine_with_a_sub-query_to_expand_results -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -Original Message- From: dan whelan d...@adicio.com Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: spatial search or null Date: Thu, 01 Dec 2011 10:22:38 -0800 Hi, how would I go about constructing a solr 3.2 spatial query that would return documents that are in a specified radius OR documents that have no location information. The query would have a similar result as this: q=City:San Diego OR -City:['' TO *] Thanks
Re: DataImportHandler w/ multivalued fields
In addition, I tried a query like below and changed the column definition to field column=raw_tag name=raw_tag splitBy=, / and still no luck. It is indexing the full content now but not multivalued. It seems like the splitBy ins't working properly. select group_concat(freetags.raw_tag separator ', ') as raw_tag, site.* from site left outer join (freetags inner join freetagged_objects) on (freetags.id = freetagged_objects.tag_id and site.siteId = freetagged_objects.object_id) group by site.siteId Am I doing something wrong? Thanks, Briggs Thompson On Thu, Dec 1, 2011 at 11:46 AM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: Hello Solr Community! I am implementing a data connection to Solr through the Data Import Handler and non-multivalued fields are working correctly, but multivalued fields are not getting indexed properly. I am new to DataImportHandler, but from what I could find, the entity is the way to go for multivalued field. The weird thing is that data is being indexed for one row, meaning first raw_tag gets populated. Anyone have any ideas? Thanks, Briggs This is the relevant part of the schema: field name =raw_tag type=text_en_lessAggressive indexed=true stored=false multivalued=true/ field name =raw_tag_string type=string indexed=false stored=true multivalued=true/ copyField source=raw_tag dest=raw_tag_string/ And the relevant part of data-import.xml: document name=merchant entity name=site query=select * from site field column=siteId name=siteId / field column=domain name=domain / field column=aliasFor name=aliasFor / field column=title name=title / field column=description name=description / field column=requests name=requests / field column=requiresModeration name=requiresModeration / field column=blocked name=blocked / field column=affiliateLink name=affiliateLink / field column=affiliateTracker name=affiliateTracker / field column=affiliateNetwork name=affiliateNetwork / field column=cjMerchantId name=cjMerchantId / field column=thumbNail name=thumbNail / field column=updateRankings name=updateRankings / field column=couponCount name=couponCount / field column=category name=category / field column=adult name=adult / field column=rank name=rank / field column=redirectsTo name=redirectsTo / field column=wwwRequired name=wwwRequired / field column=avgSavings name=avgSavings / field column=products name=products / field column=nameChecked name=nameChecked / field column=tempFlag name=tempFlag / field column=created name=created / field column=enableSplitTesting name=enableSplitTesting / field column=affiliateLinklock name=affiliateLinklock / field column=hasMobileSite name=hasMobileSite / field column=blockSite name=blockSite / entity name=merchant_tags pk=siteId query=select raw_tag, freetags.id, freetagged_objects.object_id as siteId from freetags inner join freetagged_objects on freetags.id=freetagged_objects.tag_id where freetagged_objects.object_id='${site.siteId}' field column=raw_tag name=raw_tag/ /entity /entity /document
Re: DataImportHandler w/ multivalued fields
Hi Briggs, By saying multivalued fields are not getting indexed prperly, do you mean to say that you are not able to search on those fields ? Have you tried actually searching your Solr index for those multivalued terms and make sure if it returns the search results ? One possibility could be that the multivalued fields are getting indexed correctly and are searchable. However, since your schema.xml has a raw_tag field whose stored attribute is set to false, you may not be able to see those fields. On Thu, Dec 1, 2011 at 1:43 PM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: In addition, I tried a query like below and changed the column definition to field column=raw_tag name=raw_tag splitBy=, / and still no luck. It is indexing the full content now but not multivalued. It seems like the splitBy ins't working properly. select group_concat(freetags.raw_tag separator ', ') as raw_tag, site.* from site left outer join (freetags inner join freetagged_objects) on (freetags.id = freetagged_objects.tag_id and site.siteId = freetagged_objects.object_id) group by site.siteId Am I doing something wrong? Thanks, Briggs Thompson On Thu, Dec 1, 2011 at 11:46 AM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: Hello Solr Community! I am implementing a data connection to Solr through the Data Import Handler and non-multivalued fields are working correctly, but multivalued fields are not getting indexed properly. I am new to DataImportHandler, but from what I could find, the entity is the way to go for multivalued field. The weird thing is that data is being indexed for one row, meaning first raw_tag gets populated. Anyone have any ideas? Thanks, Briggs This is the relevant part of the schema: field name =raw_tag type=text_en_lessAggressive indexed=true stored=false multivalued=true/ field name =raw_tag_string type=string indexed=false stored=true multivalued=true/ copyField source=raw_tag dest=raw_tag_string/ And the relevant part of data-import.xml: document name=merchant entity name=site query=select * from site field column=siteId name=siteId / field column=domain name=domain / field column=aliasFor name=aliasFor / field column=title name=title / field column=description name=description / field column=requests name=requests / field column=requiresModeration name=requiresModeration / field column=blocked name=blocked / field column=affiliateLink name=affiliateLink / field column=affiliateTracker name=affiliateTracker / field column=affiliateNetwork name=affiliateNetwork / field column=cjMerchantId name=cjMerchantId / field column=thumbNail name=thumbNail / field column=updateRankings name=updateRankings / field column=couponCount name=couponCount / field column=category name=category / field column=adult name=adult / field column=rank name=rank / field column=redirectsTo name=redirectsTo / field column=wwwRequired name=wwwRequired / field column=avgSavings name=avgSavings / field column=products name=products / field column=nameChecked name=nameChecked / field column=tempFlag name=tempFlag / field column=created name=created / field column=enableSplitTesting name=enableSplitTesting / field column=affiliateLinklock name=affiliateLinklock / field column=hasMobileSite name=hasMobileSite / field column=blockSite name=blockSite / entity name=merchant_tags pk=siteId query=select raw_tag, freetags.id, freetagged_objects.object_id as siteId from freetags inner join freetagged_objects on freetags.id=freetagged_objects.tag_id where freetagged_objects.object_id='${site.siteId}' field column=raw_tag name=raw_tag/ /entity /entity /document -- Thanks and Regards Rahul A. Warawdekar
Re: DataImportHandler w/ multivalued fields
Hey Rahul, Thanks for the response. I actually just figured it thankfully :). To answer your question, the raw_tag is indexed and not stored (tokenized), and then there is a copyField for raw_tag to raw_tag_string which would be used for facets. That *should have* been displayed in the results. The silly mistake I made was not camel casing multiValued, which is clearly the source of the problem. The second email I sent changing the query and using the split for the multivalued field had an error in it in the form of a missing line: transformer=RegexTransformer in the entity declaration. Anyhow, thanks for the quick response! Briggs On Thu, Dec 1, 2011 at 12:57 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Hi Briggs, By saying multivalued fields are not getting indexed prperly, do you mean to say that you are not able to search on those fields ? Have you tried actually searching your Solr index for those multivalued terms and make sure if it returns the search results ? One possibility could be that the multivalued fields are getting indexed correctly and are searchable. However, since your schema.xml has a raw_tag field whose stored attribute is set to false, you may not be able to see those fields. On Thu, Dec 1, 2011 at 1:43 PM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: In addition, I tried a query like below and changed the column definition to field column=raw_tag name=raw_tag splitBy=, / and still no luck. It is indexing the full content now but not multivalued. It seems like the splitBy ins't working properly. select group_concat(freetags.raw_tag separator ', ') as raw_tag, site.* from site left outer join (freetags inner join freetagged_objects) on (freetags.id = freetagged_objects.tag_id and site.siteId = freetagged_objects.object_id) group by site.siteId Am I doing something wrong? Thanks, Briggs Thompson On Thu, Dec 1, 2011 at 11:46 AM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: Hello Solr Community! I am implementing a data connection to Solr through the Data Import Handler and non-multivalued fields are working correctly, but multivalued fields are not getting indexed properly. I am new to DataImportHandler, but from what I could find, the entity is the way to go for multivalued field. The weird thing is that data is being indexed for one row, meaning first raw_tag gets populated. Anyone have any ideas? Thanks, Briggs This is the relevant part of the schema: field name =raw_tag type=text_en_lessAggressive indexed=true stored=false multivalued=true/ field name =raw_tag_string type=string indexed=false stored=true multivalued=true/ copyField source=raw_tag dest=raw_tag_string/ And the relevant part of data-import.xml: document name=merchant entity name=site query=select * from site field column=siteId name=siteId / field column=domain name=domain / field column=aliasFor name=aliasFor / field column=title name=title / field column=description name=description / field column=requests name=requests / field column=requiresModeration name=requiresModeration / field column=blocked name=blocked / field column=affiliateLink name=affiliateLink / field column=affiliateTracker name=affiliateTracker / field column=affiliateNetwork name=affiliateNetwork / field column=cjMerchantId name=cjMerchantId / field column=thumbNail name=thumbNail / field column=updateRankings name=updateRankings / field column=couponCount name=couponCount / field column=category name=category / field column=adult name=adult / field column=rank name=rank / field column=redirectsTo name=redirectsTo / field column=wwwRequired name=wwwRequired / field column=avgSavings name=avgSavings / field column=products name=products / field column=nameChecked name=nameChecked / field column=tempFlag name=tempFlag / field column=created name=created / field column=enableSplitTesting name=enableSplitTesting / field column=affiliateLinklock name=affiliateLinklock / field column=hasMobileSite name=hasMobileSite / field column=blockSite name=blockSite / entity name=merchant_tags pk=siteId query=select raw_tag, freetags.id, freetagged_objects.object_id as siteId from freetags inner join freetagged_objects on freetags.id=freetagged_objects.tag_id where freetagged_objects.object_id='${site.siteId}' field
Dealing with dashes with solr.PatternReplaceCharFilterFactory
Hi all, We're encountering a problem with querying terms with dashes (and other non-alphanumeric characters). For example, we use PatternReplaceCharFilterFactory to replace dashes with blank characters for both index and query, however any terms with dashes in them will not return any results. For example: searching for 'cdka' won't return any results, even though 'cdka-1' should be indexed. This is similar the problem posted here ( http://stackoverflow.com/questions/6459695/solr-ngramtokenizerfactory-and-patternreplacecharfilterfactory-analyzer-result) without a response. The following is the relevant part of the schema: - fieldType name=edge_ngram class=solr.TextField positionIncrementGap=1 analyzer type=index charfilter class=solr.PatternReplaceCharFilterFactory pattern=- replacement=/ tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=0 splitOnNumerics=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front / /analyzer analyzer type=query charfilter class=solr.PatternReplaceCharFilterFactory pattern=- replacement=/ tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=0 splitOnNumerics=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType fields field name=names_auto type=edge_ngram indexed=true stored=true multiValued=false / .. /fields - Thanks for any help anyone can provide! Aaron
Re: Configuring the Distributed
On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com
Re: Error in New Solr version
You are using the uncomitted FieldCollapse component for 1.4.x. Now, on 3.x field collapse component is not that anymore. You must remove it and configure the out-of-the-box one. On Thu, Dec 1, 2011 at 11:34 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, comment out the lines with the collapse component in your solrconfig.xml if not need it. otherwise, you're missing the right jar's for this component, or path's to this jars in your solrconfig.xml are wrong. regards vadim 2011/12/1 Pawan Darira pawan.dar...@gmail.com Hi I am migrating from Solr 1.4 to Solr 3.2. I am getting below error in my logs org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent Could not found satisfactory solution on google. please help thanks Pawan -- Un saludo, Samuel García.
Re: mysolr python client
Nice job, pythonic solr access!! Thanks for the effort On Thu, Dec 1, 2011 at 5:53 PM, Rubén Abad rua...@gmail.com wrote: Hi Jens, Our objective with mysolr was to create a pythonic Apache Solr binding. But we also have been working in speed and concurrency. We always use the Python QueryResponseWriter, because it avoids us dependencies (a XML or JSON parser). We would also like to create a complete concurrent API, but at the moment only querying is working. Our main goal is to keep evolving mysolr with the feedback we receive from the community. I hope I have answered your questions. Thanks for your interest, Rubén Abad rua...@gmail.com On Thu, Dec 1, 2011 at 10:42 AM, Jens Grivolla j+...@grivolla.net wrote: On 11/30/2011 05:40 PM, Marco Martinez wrote: For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: http://mysolr.redtuna.org/ Do you know what advantages it has over pysolr or solrpy? On the page it only says mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions. Thanks, Jens -- Whether it's science, technology, personal experience, true love, astrology, or gut feelings, each of us has confidence in something that we will never fully comprehend. --Roy H. William
Re: Configuring the Distributed
Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: Configuring the Distributed
On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com - Mark Miller lucidimagination.com
Re: Configuring the Distributed
hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote: On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote: I am currently looking at the latest solrcloud branch and was wondering if there was any documentation on configuring the DistributedUpdateProcessor? What specifically in solrconfig.xml needs to be added/modified to make distributed indexing work? Hi Jaime - take a look at solrconfig-distrib-update.xml in solr/core/src/test-files You need to enable the update log, add an empty replication handler def, and an update chain with solr.DistributedUpdateProcessFactory in it. -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com - Mark Miller lucidimagination.com
Multithreaded DIH bug
I'm trying to use multiple threads with DIH but I keep receiving the following error.. Operation not allowed after ResultSet closed Is there anyway I can fix this? Dec 1, 2011 4:38:47 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:java.lang.RuntimeException: Error in multi-threaded import at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:339) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:228) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:262) at org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.getAllNonCachedRows(CachedSqlEntityProcessor.java:72) at org.apache.solr.handler.dataimport.EntityProcessorBase.getIdCacheData(EntityProcessorBase.java:201) at org.apache.solr.handler.dataimport.CachedSqlEntityProcessor.nextRow(CachedSqlEntityProcessor.java:60) at org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper.nextRow(ThreadedEntityProcessorWrapper.java:84) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:449) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:402) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:469) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:356) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:409) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.sql.SQLException: Operation not allowed after ResultSet closed at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.ResultSetImpl.checkClosed(ResultSetImpl.java:794) at com.mysql.jdbc.ResultSetImpl.next(ResultSetImpl.java:7139) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:331) ... 14 more
Re: (fq=field1:val1 AND field2:val2) VS fq=field1:val1fq=field2:val2 and filterCache
On 12/1/2011 8:01 AM, Antoine LE FLOC'H wrote: Is there any difference in the way things are stored in the filterCache if I do (fq=field1:val1 AND field2:val2) or fq=field1:valfq=field2:val2 eventhough these are logically identical ? What get stored exactly ? Also can you point me to where in the Solr source code this processing happens ? Your first example would result in one entry in filterCache, probably for +field1:val1 +field2:val2 which is what the parser ultimately reduces the query to. Your second example will result in two separate entries in filterCache. The second example takes more cache space, but it is also more reusable. If you started with a clean cache and sent fq=field2:val2field3:val3 immediately after sending your second example, one of the filter queries would be satisfied from the cache, so Solr would use fewer resources on the query as a whole. If you sent your first example and then fq=(field2:val2 AND field3:val3) there would be no speedup from the cache, because the new query wouldn't match the previous one at all. Thanks, Shawn
Re: Configuring the Distributed
Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more replicas is trivial though. - Mark On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote: Another question, is there any support for repartitioning of the index if a new shard is added? What is the recommended approach for handling this? It seemed that the hashing algorithm (and probably any) would require the index to be repartitioned should a new shard be added. On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks I will try this first thing in the morning. On
Re: Configuring the Distributed
Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also be split as needed. Though I still wonder if micro shards will be worth the extra complications myself... Right now though, the idea is that you should pick a good number of partitions to start given your expected data ;) Adding more
Re: Configuring the Distributed
Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi
Re: Configuring the Distributed
In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off. The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though. The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds. This can be a fairly significant % hit at search time. You also have the added complexity of managing 20 indexes per node in solr code. I think that both options have there +/-'s and eventually we could perhaps support both. To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half
Re: Configuring the Distributed
Sorry - missed something - you also have the added cost of shipping the new half index to all of the replicas of the original shard with the splitting method. Unless you somehow split on every replica at the same time - then of course you wouldn't be able to avoid the 'busy' replica, and it would probably be fairly hard to juggle. On Dec 1, 2011, at 9:37 PM, Mark Miller wrote: In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off. The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though. The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds. This can be a fairly significant % hit at search time. You also have the added complexity of managing 20 indexes per node in solr code. I think that both options have there +/-'s and eventually we could perhaps support both. To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to
Re: Configuring the Distributed
So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now
Re: Configuring the Distributed
It's not full of details yet, but there is a JIRA issue here: https://issues.apache.org/jira/browse/SOLR-2595 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you wanted to keep the full balance always, this would mean splitting every shard at once, yes. But this depends on how many boxes (partitions) you are willing/able to add at a time. You might just split one index to start - now it's hash range would be handled by two shards instead of one (if you have 3 replicas per shard, this would mean adding 3 more boxes). When you needed to expand again, you would split another index that was still handling its full starting range. As you grow, once you split every original index, you'd start again, splitting one of the now half ranges. Is there also an index merger in contrib which could be used to merge indexes? I'm assuming this would be the process? You can merge with IndexWriter.addIndexes (Solr also has an admin command that can do this). But I'm not sure where this fits in? - Mark On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote: Not yet - we don't plan on working on this until a lot of other stuff is working solid at this point. But someone else could jump in! There are a couple ways to go about it that I know of: A more long term solution may be to start using micro shards - each index starts as multiple indexes. This makes it pretty fast to move mirco shards around as you decide to change partitions. It's also less flexible as you are limited by the number of micro shards you start with. A more simple and likely first step is to use an index splitter . We already have one in lucene contrib - we would just need to modify it so that it splits based on the hash of the document id. This is super flexible, but splitting will obviously take a little while on a huge index. The current index splitter is a multi pass splitter - good enough to start with, but most files under codec control these days, we may be able to make a single pass splitter soon as well. Eventually you could imagine using both options - micro shards that could also
Re: Configuring the Distributed
Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards indexes based on the hash algorithm. Not something I've thought deeply about myself yet, but I think the idea would be to split as many as you felt you needed to. If you
Re: Configuring the Distributed
Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On
Re: Configuring the Distributed
Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to
Re: Configuring the Distributed
Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain. I guess I don't yet understand what you are trying to test. Sent from my iPad On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote: Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I couldn't resist, I attempted to do this tonight, I used the solrconfig you mentioned (as is, no modifications), I setup a 2 shard cluster in collection1, I sent 1 doc to 1 of the shards, updated it and sent the update to the other. I don't see the modifications though I only see the original document. The following is the test public void update() throws Exception { String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); server.add(solrDoc); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content, updated value); server = servers.get(http://localhost:7574/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); System.out.println(done); } key is my unique field in schema.xml What am I doing wrong? On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you
Possible to facet across two indices, or document types in single index?
Hello: I'm trying to relate together two different types of documents. Currently I have 'node' documents that reside in one index (core), and 'product mapping' documents that are in another index. The product mapping index is used to map tenant products to nodes. The nodes are canonical content that gets updated every quarter, where as the product mappings can change at any time. I put them in two indexes because (1) canonical content changes rarely, and I don't want product mapping changes to affect it (commit, re-open searchers etc.), and I would like to support multiple tenants mapping products to the same canonical content to avoid duplication (a few GB). This arrange has worked well thus far, but only in the sense that for each node result returned, I can query the product mapping index to determine the products mapped to the node. I combine this information within my application and return it to the client. This works okay in that there are only 5-20 results returned per page (start, rows). But now I'm being asked to facet the product catagories (multi-valued field within a product mapping document) along with other facets defined in the canonical content. Can this be done with Solr 3.5.0? I've been looking into sub-queries, function queries etc. Also, I've seen various postings indicating that one needs to denormalize more. I don't want to add product information as fields to the canonical content. Not only does that defeat my objective (1) above, but Solr does not support incremental updates of document fields. So, one approach is to issue by query to the canonical index and get all of the document IDs (could be 1000s), and then issue a filter query to the product mapping index with all of these IDs and have Solr facet the product categories. Is that efficient? I suppose I could use HTTP POST (via SolrJ) to convey that payload of IDs? I could then take the facet results of that query and combine them with the canonical index results and return them to the client. That may be do-able, but then let's say the user clicks on a product category facet value to narrow the node results to only those mapped to category XYZ. This will not affect the query issued against the canonical content index. Instead, I think I'd have to go through the canonical results and eliminate the nodes that are not associated with product category XYZ. Then, if the current page of results is inadequate (rows=10, but 3 nodes were eliminated), I'd have to go back to the canonical index to get more rows, eliminate some some again perhaps, get more etc. That sounds unappealing and low performing. Is there a Solr way to do this? My Packt Apache Solr 3 Enterprise Search Server book (page 34) states regarding separate indices: If you do develop separate schemas and if you need to search across your indices in one search then you must perform a distributed search, described in the last chapter. A distributed search is usually a feature employed for a large corpus but it applies here too. But in the chapter it goes on to talk about dealing with sharding, replication etc. to support a large corpus, not necessarily tying together two different indexes. Is it possible to accomplish my goal in a less ugly way than I outlined above? Since we only have a single tenant to worry about, I could use a combined index at least for a few months (separate fields per document type, IDs are unique among then all) if that makes a difference. Thanks! Jeff -- Jeff Schmidt 535 Consulting j...@535consulting.com http://www.535consulting.com (650) 423-1068
XPathEntityProcessor, Fields without Content, and Null-backup
Hello Solr and Solr-Users, I can't confidently say I completeley understand all that these classes so boldy tackle (that is, XPathEntityProcessor and XPathRecordReader) , but there may be someone who does. Nonetheless,I think I've got some or most of this right, and more likely there are more someones like that. So, I won't qualify everything I say with a maybe -- lets this be the refactoring of those. Whenver mapping an XML file into a Solr Index, within the XPathRecordReader, (used by the XPathEntityProcessor within the DataImportHandler), if (A) a field is perceived to be null and is multivalued, it is pushed a value of null (on top of any other values it previously had). Otherwise (B) for multivalued fields, any found value is pushed onto its existing list of values, and the field is marked as found within the frame (a.k.a record). In general, when the end-tag of a record is seen, (C) the XPathRecordReader clears all of the field's values which have been marked as found, as tidiness is a value and they are supposedly no longer useful. However, suppose that for a given record and multivalued field, a value is never found (though it may have been found for other fields in the record), only (A) will have occurred, never will (B) have occurred, the field will never have been marked as found, and thus (C) never will have occurred for the field. So, the field will remain, with its list of nulls. This list of nulls will grow until either the last record or a non-null value is seen. And so, (1) an out-of-memory error may occur, given sufficiently many records and a mortal computer. Moreover, (2), a transformer cannot reliably depend on the number of nulls in the field (and this information cannot be guaranteed to be determined by some other value). I will try to provide more information, if this seems an issue and if there doesn't seem to be an answer. At this point, if I understand the problem correctly, it seems the answer is to 'mark' those null fields, considering 'null' and added value. Thanks, Michael Watts
Re: Configuring the Distributed
Really just trying to do a simple add and update test, the chain missing is just proof of my not understanding exactly how this is supposed to work. I modified the code to this String key = 1; SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.setField(key, key); solrDoc.addField(content_mvtxt, initial value); SolrServer server = servers .get(http://localhost:8983/solr/collection1;); UpdateRequest ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); server.commit(); solrDoc = new SolrInputDocument(); solrDoc.addField(key, key); solrDoc.addField(content_mvtxt, updated value); server = servers.get(http://localhost:7574/solr/collection1;); ureq = new UpdateRequest(); ureq.setParam(update.chain, distrib-update-chain); // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285); ureq.add(solrDoc); ureq.setParam(shards, localhost:8983/solr/collection1,localhost:7574/solr/collection1); ureq.setParam(self, foo); ureq.setAction(ACTION.COMMIT, true, true); server.request(ureq); // server.add(solrDoc); server.commit(); server = servers.get(http://localhost:8983/solr/collection1;); server.commit(); System.out.println(done); but I'm still seeing the doc appear on both shards.After the first commit I see the doc on 8983 with initial value. after the second commit I see the updated value on 7574 and the old on 8983. After the final commit the doc on 8983 gets updated. Is there something wrong with my test? On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com wrote: Getting late - didn't really pay attention to your code I guess - why are you adding the first doc without specifying the distrib update chain? This is not really supported. It's going to just go to the server you specified - even with everything setup right, the update might then go to that same server or the other one depending on how it hashes. You really want to just always use the distrib update chain. I guess I don't yet understand what you are trying to test. Sent from my iPad On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote: Not sure offhand - but things will be funky if you don't specify the correct numShards. The instance to shard assignment should be using numShards to assign. But then the hash to shard mapping actually goes on the number of shards it finds registered in ZK (it doesn't have to, but really these should be equal). So basically you are saying, I want 3 partitions, but you are only starting up 2 nodes, and the code is just not happy about that I'd guess. For the system to work properly, you have to fire up at least as many servers as numShards. What are you trying to do? 2 partitions with no replicas, or one partition with one replica? In either case, I think you will have better luck if you fire up at least as many servers as the numShards setting. Or lower the numShards setting. This is all a work in progress by the way - what you are trying to test should work if things are setup right though. - Mark On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote: Thanks for the quick response. With that change (have not done numShards yet) shard1 got updated. But now when executing the following queries I get information back from both, which doesn't seem right http://localhost:7574/solr/select/?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc http://localhost:8983/solr/select?q=*:* docstr name=key1/strstr name=content_mvtxtupdated value/str/doc On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...sorry bout that - so my first guess is that right now we are not distributing a commit (easy to add, just have not done it). Right now I explicitly commit on each server for tests. Can you try explicitly committing on server1 after updating the doc on server 2? I can start distributing commits tomorrow - been meaning to do it for my own convenience anyhow. Also, you want to pass the sys property numShards=1 on startup. I think it defaults to 3. That will give you one leader and one replica. - Mark On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote: So I
Re: Configuring the Distributed
Well, this goes both ways. It is not that unusual to take a node down for maintenance of some kind or even to have a node failure. In that case, it is very nice to have the load from the lost node be spread fairly evenly across the remaining cluster. Regarding the cost of having several micro-shards, they are also an opportunity for threading the search. Most sites don't have enough queries coming in to occupy all of the cores in modern machines so threading each query can actually be a substantial benefit in terms of query time. On Thu, Dec 1, 2011 at 6:37 PM, Mark Miller markrmil...@gmail.com wrote: To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting).
Re: Configuring the Distributed
With micro-shards, you can use random numbers for all placements with minor constraints like avoiding replicas sitting in the same rack. Since the number of shards never changes, things stay very simple. On Thu, Dec 1, 2011 at 6:44 PM, Mark Miller markrmil...@gmail.com wrote: Sorry - missed something - you also have the added cost of shipping the new half index to all of the replicas of the original shard with the splitting method. Unless you somehow split on every replica at the same time - then of course you wouldn't be able to avoid the 'busy' replica, and it would probably be fairly hard to juggle. On Dec 1, 2011, at 9:37 PM, Mark Miller wrote: In this case we are still talking about moving a whole index at a time rather than lots of little documents. You split the index into two, and then ship one of them off. The extra cost you can avoid with micro sharding will be the cost of splitting the index - which could be significant for a very large index. I have not done any tests though. The cost of 20 micro-shards is that you will always have tons of segments unless you are very heavily merging - and even in the very unusual case of each micro shard being optimized, you have essentially 20 segments. Thats best case - normal case is likely in the hundreds. This can be a fairly significant % hit at search time. You also have the added complexity of managing 20 indexes per node in solr code. I think that both options have there +/-'s and eventually we could perhaps support both. To kick things off though, adding another partition should be a rare event if you plan carefully, and I think many will be able to handle the cost of splitting (you might even mark the replica you are splitting on so that it's not part of queries while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little documents. Much faster. As in thousands of times faster. On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote: Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though. On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote: Right now lets say you have one shard - everything there hashes to range X. Now you want to split that shard with an Index Splitter. You divide range X in two - giving you two ranges - then you start splitting. This is where the current Splitter needs a little modification. You decide which doc should go into which new index by rehashing each doc id in the index you are splitting - if its hash is greater than X/2, it goes into index1 - if its less, index2. I think there are a couple current Splitter impls, but one of them does something like, give me an id - now if the id's in the index are above that id, goto index1, if below, index2. We need to instead do a quick hash rather than simple id compare. Why do you need to do this on every shard? The other part we need that we dont have is to store hash range assignments in zookeeper - we don't do that yet because it's not needed yet. Instead we currently just simply calculate that on the fly (too often at the moment - on every request :) I intend to fix that of course). At the start, zk would say, for range X, goto this shard. After the split, it would say, for range less than X/2 goto the old node, for range greater than X/2 goto the new node. - Mark On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: hmmm.This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range. On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote: On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: I am not familiar with the index splitter that is in contrib, but I'll take a look at it soon. So the process sounds like it would be to run this on all of the current shards