Re: Tomcat creates a thread for each SOLR core
Hello again, Current situation is, after setting the two options in order not to load the cores on start up and ramBufferSizeMB=32 Tomcat is stable, responsive, threads reach 60 as a maximum. Browsing and storing are fast. I should note that I have many cores with small amount of documents. Unfortunately the problem with the creation of a new core taking 20 minutes still exists. Next step will be downgrading to Java 7u25. Any other suggestions will be highly appreciated. Thanks in advance. P.S previous SOLR version from which I updated was 3.6. Regards, Atanas Atanasov On Thu, Apr 10, 2014 at 6:06 PM, Shawn Heisey s...@elyograg.org wrote: On 4/10/2014 12:40 AM, Atanas Atanasov wrote: I need some help. After updating to SOLR 4.4 the tomcat process is consuming about 2GBs of memory, the CPU usage is about 40% after the start for about 10 minutes. However, the bigger problem is, I have about 1000 cores and seems that for each core a thread is created. The process has more than 1000 threads and everything is extremely slow. Creating or unloading a core even without documents takes about 20 minutes. Searching is more or less good, but storing also takes a lot. Is there some configuration I missed or that I did wrong? There aren't many calls, I use 64 bit tomcat 7, SOLR 4.4, latest 64 bit Java. The machine has 24 GBs of RAM, a CPU with 16 cores and is running Windows Server 2008 R2. Index is uppdated every 30 seconds/10 000 documents. I haven't checked the number of threads before the update, because I didn't have to, it was working just fine. Any suggestion will be highly appreciated, thank you in advance. If creating a core takes 20 minutes, that sounds to me like the JVM is doing constant full garbage collections to free up enough memory for basic system operation. It could also be explained by temporary work threads having to wait to execute because the servlet container will not allow them to run. When indexing is happening, each core will set aside some memory for buffering index updates. By default, the value of ramBufferSizeMB is 100. If all your cores are indexing at once, multiply the indexing buffer by 1000, and you'll require 100GB of heap memory. You'll need to greatly reduce that buffer size. This buffer was 32MB by default in 4.0 and earlier. If you are not setting this value, this change sounds like it might fully explain what you are seeing. https://issues.apache.org/jira/browse/SOLR-4074 What version did you upgrade from? Solr 4.x is a very different beast than earlier major versions. I believe there may have been some changes made to reduce memory usage in versions after 4.4.0. The jetty that comes with Solr is configured to allow 10,000 threads. Most people don't have that many, even on a temporary basis, but bad things happen when the servlet container will not allow Solr to start as many as it requires. I believe that the typical default maxThreads value you'll find in a servlet container config is 200. Erick's right about a 6GB heap being very small for what you are trying to do. Putting 1000 cores on one machine is something I would never try. If it became a requirement I had to deal with, I wouldn't try it unless the machine had a lot more CPU cores, hundreds of gigabytes of RAM, and a lot of extremely fast disk space. If this worked before a Solr upgrade, I'm amazed. Congratulations to you for fine work! NB: Oracle Java 7u25 is what you should be using. 7u40 through 7u51 have known bugs that affect Solr/Lucene. These should be fixed by 7u60. A pre-release of that is available now, and it should be generally available in May 2014. Thanks, Shawn
[ANNOUNCE] Apache Solr 4.7.2 released.
April 2014, Apache Solr™ 4.7.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.7.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.7.2 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.7.2 includes 2 bug fixes, as well as Lucene 4.7.2 and its bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
Re: filter capabilities are limited?
Variables over which the comparison is a string data type. I can not apply to them or mathematical functions needed to perform the conversion type (string to integer). Will I be able to build a circuit without changing a filter? -- View this message in context: http://lucene.472066.n3.nabble.com/filter-capabilities-are-limited-tp4130458p4131174.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr join and lucene scoring
Thank you for the clarification. We really need scoring with solr joins, but as you can see I'm not a specialist in solr development. We would like to hire somebody with more experience to write a qparser plugin for scoring in joins and donate the source code to the community. Any suggestions where we could find somebody with the fitting experience? Zitat von Mikhail Khludnev mkhlud...@griddynamics.com: On Wed, Apr 9, 2014 at 1:33 PM, m...@preselect-media.com wrote: Hello Mikhail, thx for the clarification. I'm a little bit confused by the answer of Alvaro, but my own tests didn't result in a proper score, so I think you're right and it's still not implemented. What do you mean with the impedance between Lucene and Solr? It's an old story, and unfortunately obvious. Using Lucene's code in Solr might not be straightforward. I haven't looked at this problem particularly, it's just a caveat. Why isn't the possibility of scoring in joins not implemented in Solr anyways when Lucene offers a solution for that? As you can see these are two separate implementation. It seems like Solr guys just didn't care about scoring (and here I share their point). It's just an exercise for someone, who needs it. Best regards, Moritz Zitat von Mikhail Khludnev mkhlud...@griddynamics.com: On Thu, Apr 3, 2014 at 1:42 PM, m...@preselect-media.com wrote: Hello, referencing to this issue: https://issues.apache.org/jira/browse/SOLR-4307 Is it still not possible with the solr query time join to use scoring? It's not implemented still. https://github.com/apache/lucene-solr/blob/trunk/solr/ core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L549 Do I still have to write my own plugin or is there a plugin somewhere I could use? I never wrote a plugin for solr before, so I would prefer if I don't have to start from scratch. The right approach from my POV is to use Lucene's join https://github.com/apache/lucene-solr/blob/trunk/lucene/ join/src/java/org/apache/lucene/search/join/JoinUtil.javain new QParser, but solving the impedance between Lucene and Solr, might be tricky. THX, Moritz -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Autocomplete with Case-insensitive feature
Hi All, I have been trying out this autocomplete feature in Solr4.7.1 using Suggester.I have configured it to display phrase suggestions also.Problem is If I type game I get suggestions as game or phrases containing game. But If I type Game *no suggestion is displayed at all*.How can I get suggestions case-insensitive? I have defined in schema.xml fields like this: field name=name_autocomplete type=text_auto indexed=true stored=true multiValued=true / copyField source=name dest=name_autocomplete / fieldType name=text_auto class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=4 outputUnigrams=true outputUnigramsIfNoShingles=true/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.TrimFilterFactory / /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autocomplete with Case-insensitive feature
Hi, Configure LowerCaseFilterFactory into the query side of your type config. Dmitry On Tue, Apr 15, 2014 at 10:50 AM, Sunayana sunayana...@wipro.com wrote: Hi All, I have been trying out this autocomplete feature in Solr4.7.1 using Suggester.I have configured it to display phrase suggestions also.Problem is If I type game I get suggestions as game or phrases containing game. But If I type Game *no suggestion is displayed at all*.How can I get suggestions case-insensitive? I have defined in schema.xml fields like this: field name=name_autocomplete type=text_auto indexed=true stored=true multiValued=true / copyField source=name dest=name_autocomplete / fieldType name=text_auto class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=4 outputUnigrams=true outputUnigramsIfNoShingles=true/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.TrimFilterFactory / /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182.html Sent from the Solr - User mailing list archive at Nabble.com. -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
Looking at this, sharding seems to be best and simple option to handle such queries. On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Salman, Let's me drop few thoughts on http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E There two aspects of this question: 1. dealing with long running processing (thread divergence actions http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and 2. an actual time checking. terminating or aborting thread (2.) are just a way to tracking time externally, and send interrupt() which the thread should react on, which they don't do now, and we returning to the core issue (1.) Solr's time allowed is to the proper way to handle this things, the only problem is that expect that the only core search is long running, but in your case rewriting MultiTermQuery-s takes a huge time. Let's consider this problem. First of all MultiTermQuery.rewrite() is the nearly design issue, after heavy rewrite occurs, it's thrown away, after search is done. I think the most straightforward way is to address this issue by caching these expensive queries. Solr does it well http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there is a workaround allows to cache disjunction legs see http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html If you still want to run expensively rewritten queries you need to implement timeout check (similar to TimeLimitingCollector) for TermsEnum returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums is the good way, to apply queries injecting time limiting wrapper TermsEnum, you might consider override methods like SolrQueryParserBase.newWildcardQuery(Term) or post process the query three after parsing. On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Anyone? On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: With reference to this thread http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E I wanted to know if there was any response to that or if Chris Harris himself can comment on what he ended up doing, that would be great! -- Regards, Salman Akram -- Regards, Salman Akram -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Regards, Salman Akram
Re: Autocomplete with Case-insensitive feature
Hi, Did u mean changing field type as fieldType name=text_auto class=solr.TextField positionIncrementGap=100 indexed=true stored=false multiValued=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory minShingleSize=2 maxShingleSize=4 outputUnigrams=true outputUnigramsIfNoShingles=true/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory / /analyzer /fieldType This did not work out for me. -- View this message in context: http://lucene.472066.n3.nabble.com/Autocomplete-with-Case-insensitive-feature-tp4131182p4131198.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing Big Data With or Without Solr
Hi All, I have worked with Solr 3.5 to implement real time search on some 100GB data, that worked fine but was little slow on complex queries(Multiple group/joined queries). But now I want to index some real Big Data(around 4 TB or even more), can SolrCloud be solution for it if not what could be the best possible solution in this case. *Stats for the previous Implementation:* It was Master Slave Architecture with normal Standalone multiple instance of Solr 3.5. There were around 12 Solr instance running on different machines. *Things to consider for the next implementation:* Since all the data is sensor data hence it is the factor of duplicity and uniqueness. *Really urgent, please take the call on priority with set of feasible solution.* Regards
Re: Class not found ICUFoldingFilter (SOLR-4852)
Hello Shawn, Thanks for your reply. Yes, I have defined ${solr.solr.home} explicitly, and all the mentioned jars present in ${solr.solr.home}/lib. solr.log also shows that those files are getting added once (grep icu4 solr.log). I could see the lines in log, INFO - 2014-04-15 15:40:21.448; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/icu4j-49.1.jar' to classloader INFO - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-icu-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-morfologik-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-smartcn-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-stempel-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-uima-4.3.1.jar' to classloader But, still, I get the same exception ICUFoldingFilter not found. However, coping those files to WEB-INF/lib, works fine for me. -- View this message in context: http://lucene.472066.n3.nabble.com/Class-not-found-ICUFoldingFilter-SOLR-4852-tp4130612p4131221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Class not found ICUFoldingFilter (SOLR-4852)
Hello Shawn, Thanks for your reply. Yes, I have defined ${solr.solr.home} explicitly, and all the mentioned jars present in ${solr.solr.home}/lib. solr.log also shows that those files are getting added once (grep icu4 solr.log). I could see the lines in log, INFO - 2014-04-15 15:40:21.448; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/icu4j-49.1.jar' to classloader INFO - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-icu-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.454; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-morfologik-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-smartcn-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-stempel-4.3.1.jar' to classloader INFO - 2014-04-15 15:40:21.455; org.apache.solr.core.SolrResourceLoader; Adding 'file:/solr/lib/lucene-analyzers-uima-4.3.1.jar' to classloader But, still, I get the same exception ICUFoldingFilter not found. However, coping those files to WEB-INF/lib, works fine for me. Thanks, Ronak On Fri, Apr 11, 2014 at 3:14 PM, ronak kirit ronak...@gmail.com wrote: Hello, I am facing the same issue discussed at SOLR-4852. I am getting below error: Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.lucene.analysis.icu.ICUFoldingFilter at org.apache.lucene.analysis.icu.ICUFoldingFilterFactory.create(ICUFoldingFilterFactory.java:50) at org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:67) I am using solr-4.3.1. As discussed at SOLR-4852, I had all the jars at (SOLR_HOME)/lib and there is no reference to lib via any of solrconfig.xml or schema.xml. I have also tried with setting sharedLib=foo, but that also didn't work. However, if I removed all the below files: icu4j-49.1.jar lucene-analyzers-morfologik-4.3.1.jar l ucene-analyzers-stempel-4.3.1.jar solr-analysis-extras-4.3.1.jar lucene-analyzers-icu-4.3.1.jar lucene-analyzers-smartcn-4.3.1.jar lucene-analyzers-uima-4.3.1.jar from $(solrhome)/lib and move to solr-webapp/webapp/WEB-INF/lib things are working fine. Any guess? Any help? Thanks, Ronak
Re: Error Arising from when I start to crawl
Hi Ridwan, This error is not related to Solr. Solr is used in IndexerJob for Nutch. This error is thrown from InjectorJob. It is related Nutch and Gora. You check your hbase and nutch configuration. You ensure the HBase run correctly and to use the correct version. For more accurate information, you should ask questions to the nutch user list with more information. 2014-04-14 5:11 GMT+03:00 Alexandre Rafalovitch arafa...@gmail.com: This is most definitely not a Solr issue, so you may want to check with Gora's list. However as a quick general hint, you problem seems to be in thus part: 3530@engr-MacBookProlocalhost . I assume it should be a server name there, but it seems to be two name joined together. So I would check where that (possibly hbase listen address) is defined and ensure it is correct. Regards, Alex On 14/04/2014 8:46 am, Ridwan Naibi ridwan.na...@gmail.com wrote: Hi there, I get the following error after I run the following command. Can you please let me know what the problem is? I have exhausted online tutorials trying to solve this issue. Thanks engr@engr-MacBookPro:~/NUTCH_HOME/apache-nutch-2.2.1/runtime/local$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 InjectorJob: starting at 2014-04-14 02:28:56 InjectorJob: Injecting urlDir: urls/seed.txt InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: � 3530@engr-MacBookProlocalhost,43200,1397436949832 at org.apache.gora.store.DataStoreFactory.createDataStore( DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore( DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore( StorageUtils.java:75) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282) Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: � 3530@engr-MacBookProlocalhost ,43200,1397436949832 at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127) at org.apache.gora.store.DataStoreFactory.initializeDataStore( DataStoreFactory.java:102) at org.apache.gora.store.DataStoreFactory.createDataStore( DataStoreFactory.java:161) ... 7 more Caused by: java.lang.IllegalArgumentException: Not a host:port pair: � 3530@engr-MacBookProlocalhost,43200,1397436949832 at org.apache.hadoop.hbase.HServerAddress.init(HServerAddress.java:60) at org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress( MasterAddressTracker.java:63) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation.getMaster(HConnectionManager.java:354) at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:94) at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109) ... 9 more
Re: Analysis Tool Not Working for CharFilterFactory?
Which version of Solr. I think there was a bug in ui. You can check network traffic to confirm. On 15/04/2014 5:32 pm, Steve Huckle steve.huc...@gmail.com wrote: I have used a CharFilterFactory in my schema.xml for fileType text_general, so that queries for cafe and café return the same results. It works correctly. Here's the relevant part of my schema.xml: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType However, using the analysis tool within the admin ui, if I analyse text_general with any field values for index and query, the output for ST, SF and LCF are all empty. Is this a bug? -- Steve Huckle If you print this email, eventually you'll want to throw it away. But there is no away. So don't print this email, even if you have to.
Re: multiple analyzers for one field
A blog post is a great idea, Alex! I think I should wait until I have a complete end-to-end implementation done before I write about it though, because I'd also like to include some tips about configuring the new suggesters with Solr (the documentation on the wiki hasn't quite caught up yet, I think), and I don't have that working as I'd like just yet. But I will follow up with something soon; probably I will be able to share code on a public repo. -Mike On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote: Hi Mike, Glad I was able to help. Good note about the PoolingReuseStrategy, I did not think of that either. Is there a blog post or a GitHub repository coming with more details on that? Sounds like something others may benefit from as well. Regards, Alex. P.s. If you don't have your own blog, I'll be happy to host such article on mine. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: I lost the original thread; sorry for the new / repeated topic, but thought I would follow up to let y'all know that I ended up implementing Alex's idea to implement an UpdateRequestProcessor in order to apply different analysis to different fields when doing something analogous to copyFields. It was pretty straightforward except that when there are multiple values, I ended up needing multiple copies of the same Analyzer. I had to implement a new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't foreseen. -Mike
Re: multiple analyzers for one field
Your call, though from experience thus sounds like either two or no blog posts. I certainly have killed a bunch of good articles by waiting for perfection:-) On 15/04/2014 7:01 pm, Michael Sokolov msoko...@safaribooksonline.com wrote: A blog post is a great idea, Alex! I think I should wait until I have a complete end-to-end implementation done before I write about it though, because I'd also like to include some tips about configuring the new suggesters with Solr (the documentation on the wiki hasn't quite caught up yet, I think), and I don't have that working as I'd like just yet. But I will follow up with something soon; probably I will be able to share code on a public repo. -Mike On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote: Hi Mike, Glad I was able to help. Good note about the PoolingReuseStrategy, I did not think of that either. Is there a blog post or a GitHub repository coming with more details on that? Sounds like something others may benefit from as well. Regards, Alex. P.s. If you don't have your own blog, I'll be happy to host such article on mine. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: I lost the original thread; sorry for the new / repeated topic, but thought I would follow up to let y'all know that I ended up implementing Alex's idea to implement an UpdateRequestProcessor in order to apply different analysis to different fields when doing something analogous to copyFields. It was pretty straightforward except that when there are multiple values, I ended up needing multiple copies of the same Analyzer. I had to implement a new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't foreseen. -Mike
Re: Indexing Big Data With or Without Solr
Hi Vineet; I've been using SolrCloud for such kind of Big Data and I think that you should consider to use it. If you have any problems you can ask it here. Thanks; Furkan KAMACI 2014-04-15 13:20 GMT+03:00 Vineet Mishra clearmido...@gmail.com: Hi All, I have worked with Solr 3.5 to implement real time search on some 100GB data, that worked fine but was little slow on complex queries(Multiple group/joined queries). But now I want to index some real Big Data(around 4 TB or even more), can SolrCloud be solution for it if not what could be the best possible solution in this case. *Stats for the previous Implementation:* It was Master Slave Architecture with normal Standalone multiple instance of Solr 3.5. There were around 12 Solr instance running on different machines. *Things to consider for the next implementation:* Since all the data is sensor data hence it is the factor of duplicity and uniqueness. *Really urgent, please take the call on priority with set of feasible solution.* Regards
Bug within the solr query parser (version 4.7.1)
Hi, I have updated my solr instance from 4.5.1 to 4.7.1. Now the parsed query seems to be not correct. Query: /*q=*:*fq=title:TEdebug=true */ Before the update the parsed filter query is */+title:te +title:t +title:e/*. After the update the parsed filter query is */+((title:te title:t)/no_coord) +title:e/*. It seems like a bug within the query parser. I also have validated the parsed filter query with the analysis component. The result was */+title:te +title:t +title:e/*. The behavior is equal on all special characters that split words into 2 parts. I use the following WordDelimiterFilter on query side: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 preserveOriginal=1/ Thanks. Johannes Additional informations: Debug before the update: lstname=debug strname=rawquerystring*:*/str strname=querystring*:*/str strname=parsedqueryMatchAllDocsQuery(*:*)/str strname=parsedquery_toString*:*/str lstname=explain/ strname=QParserLuceneQParser/str arrname=filter_queries str(title:((TE)))/str /arr *arrname=parsed_filter_queries ** **str+title:te +title:t +title:e/str ** **/arr * ... Debug after the update: lstname=debug strname=rawquerystring*:*/str strname=querystring*:*/str strname=parsedqueryMatchAllDocsQuery(*:*)/str strname=parsedquery_toString*:*/str lstname=explain/ strname=QParserLuceneQParser/str arrname=filter_queries str(title:((TE)))/str /arr *arrname=parsed_filter_queries ** **str+((title:te title:t)/no_coord) +title:e/str ** **/arr* ... title-field definition: fieldType name=text_title class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1 stemEnglishPossessive=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType
clusterstate.json does not reflect current state of down versus active
Solr 4.7.1 I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was hoping to use clusterstate.json would reflect the up/down state of each core as well as whether or not a given core was leader. clusterstate.json is not kept up to date with what I see going on in my logs though - I see the leader election process play out. I would expect that state would show down immediately for replicas on the node that I have shut down. Eventually, after about 30 minutes, all of the leader election processes complete and clusterstate.json gets updated to the true state for each replica. Why does it take so long for clusterstate.json to reflect the correct state? Is there a better way to determine the state of the system? (In my case, each node has upwards of 1,000 1-shard collections. There are two nodes in the cluster - each collection has 2 replicas.) Thanks much. rich
Re: clusterstate.json does not reflect current state of down versus active
On 4/15/2014 8:58 AM, Rich Mayfield wrote: I am trying to orchestrate a fast restart of a SolrCloud (4.7.1). I was hoping to use clusterstate.json would reflect the up/down state of each core as well as whether or not a given core was leader. clusterstate.json is not kept up to date with what I see going on in my logs though - I see the leader election process play out. I would expect that state would show down immediately for replicas on the node that I have shut down. Eventually, after about 30 minutes, all of the leader election processes complete and clusterstate.json gets updated to the true state for each replica. Why does it take so long for clusterstate.json to reflect the correct state? Is there a better way to determine the state of the system? (In my case, each node has upwards of 1,000 1-shard collections. There are two nodes in the cluster - each collection has 2 replicas.) First, I'll admit that my experience with SolrCloud is not as extensive as my experience with non-cloud installs. I do have a SolrCloud (4.2.1) install, but it's a the smallest possible redundant setup -- three servers, two run Solr and Zookeeper, the third runs Zookeeper only. What are you trying to achieve with your restart? Can you just reload the collections one by one instead? Assuming that reloading isn't going to work for some reason (rebooting for OS updates is one possibility), we need to determine why it takes so long for a node to stabilize. Here's a bunch of info about performance problems with Solr. I wrote it, so we can discuss it in depth if you like: http://wiki.apache.org/solr/SolrPerformanceProblems I have three possible suspicions for the root of your problem. It is likely to be one of them, but it could be a combination of any or all of them. Because this happens at startup, I don't think it's likely that you're dealing with a GC problem caused by a very large heap. 1) The system is replaying 1000 transaction logs (possibly large, one for each core) at startup, and also possibly initiating index recovery using replication. 2) You don't have enough RAM to cache your index effectively. 3) Your java heap is too small. If your zookeeper ensemble does not use separate disks from your Solr data (or separate servers), there could be an issue with zookeeper client timeouts that's completely separate from any other problems. I haven't addressed the fact that your cluster state doesn't update quickly. This might be a bug, but if we can deal with the slow startup/stabilization first, then we can see whether there's anything left to deal with on the cluster state. Thanks, Shawn
Re: Empty documents in Solr\lucene 3.6
On 4/15/2014 9:41 AM, Alexey Kozhemiakin wrote: We've faced a strange data corruption issue with one of our clients old solr setup (3.6). When we do a query (id:X OR id:Y) we get 2 nodes, one contains normal doc data, another is empty (doc /). We've looked inside lucene index using Luke - same story, one of documents is empty. When we click on 1st document - it shows nothing. http://snag.gy/O5Lgq.jpg Probably files for stored data were corrupted? But luke index check says OK. Any clues how to troubleshoot root cause? Do you know for sure that the index was OK at some point? Do you know what might have happened when it became not OK, like a system crash? If you have Solr logs from whatever event caused the problem, we might be able to figure it out ... but if you don't know when it happened or you don't have logs, it might not be possible to know what happened. The document may have simply been indexed incorrectly. Thanks, Shawn
Empty documents in Solr\lucene 3.6
Dear Community, We've faced a strange data corruption issue with one of our clients old solr setup (3.6). When we do a query (id:X OR id:Y) we get 2 nodes, one contains normal doc data, another is empty (doc /). We've looked inside lucene index using Luke - same story, one of documents is empty. When we click on 1st document - it shows nothing. http://snag.gy/O5Lgq.jpg Probably files for stored data were corrupted? But luke index check says OK. Any clues how to troubleshoot root cause? Best regards, Alexey
Race condition in Leader Election
I see something similar where, given ~1000 shards, both nodes spend a LOT of time sorting through the leader election process. Roughly 30 minutes. I too am wondering - if I force all leaders onto one node, then shut down both, then start up the node with all of the leaders on it first, then start up the other node, then I think I would have a much faster startup sequence. Does that sound reasonable? And if so, is there a way to trigger the leader election process without taking the time to unload and recreate the shards? Hi When restarting a node in solrcloud, i run into scenarios where both the replicas for a shard get into recovering state and never come up causing the error No servers hosting this shard. To fix this, I either unload one core or restart one of the nodes again so that one of them becomes the leader. Is there a way to force leader election for a shard for solrcloud? Is there a way to break ties automatically (without restarting nodes) to make a node as the leader for the shard? Thanks Nitin
RE: Empty documents in Solr\lucene 3.6
The system was up and running for long time(months) without any updates. There was no crashes for sure, at least support team says so. Logs indicate that at some point there was not enough disk space (caused by weekend index optimization). Were there any other similar cases or it's unique for us? Alexey. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, April 15, 2014 18:50 To: solr-user@lucene.apache.org Subject: Re: Empty documents in Solr\lucene 3.6 Do you know for sure that the index was OK at some point? Do you know what might have happened when it became not OK, like a system crash? If you have Solr logs from whatever event caused the problem, we might be able to figure it out ... but if you don't know when it happened or you don't have logs, it might not be possible to know what happened. The document may have simply been indexed incorrectly. Thanks, Shawn
Re: Race condition in Leader Election
We have to fix that then. -- Mark Miller about.me/markrmiller On April 15, 2014 at 12:20:03 PM, Rich Mayfield (mayfield.r...@gmail.com) wrote: I see something similar where, given ~1000 shards, both nodes spend a LOT of time sorting through the leader election process. Roughly 30 minutes. I too am wondering - if I force all leaders onto one node, then shut down both, then start up the node with all of the leaders on it first, then start up the other node, then I think I would have a much faster startup sequence. Does that sound reasonable? And if so, is there a way to trigger the leader election process without taking the time to unload and recreate the shards? Hi When restarting a node in solrcloud, i run into scenarios where both the replicas for a shard get into recovering state and never come up causing the error No servers hosting this shard. To fix this, I either unload one core or restart one of the nodes again so that one of them becomes the leader. Is there a way to force leader election for a shard for solrcloud? Is there a way to break ties automatically (without restarting nodes) to make a node as the leader for the shard? Thanks Nitin
Re: Empty documents in Solr\lucene 3.6
On 4/15/2014 10:22 AM, Alexey Kozhemiakin wrote: The system was up and running for long time(months) without any updates. There was no crashes for sure, at least support team says so. Logs indicate that at some point there was not enough disk space (caused by weekend index optimization). Software behavior becomes very difficult to define when a resource (RAM, disk space, etc) is completely exhausted. Even if Lucene's behavior is well defined (which I think it might be -- the index itself is NOT corrupt), Solr is another layer here, and I don't know whether its behavior is well defined. I suspect that it's not. This might explain what you're seeing. That might be the only information you'll get, if there's nothing else in the logs besides the inability to write to the disk. Thanks, Shawn
Re: What's the actual story with new morphline and hadoop contribs?
The solr morphline jars are integrated with solr by way of the solr specific solr/contrib/map-reduce module. Ingestion from Flume into Solr is available here: http://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink FWIW, for our purposes we see no role for DataImportHandler anymore. Wolfgang. On Apr 15, 2014, at 6:01 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: The use case I keep thinking about is Flue/Morphline replacing DataImportHandler. So, when I saw morphline shipped with Solr, I tried to understand whether it is a step towards it. As it is, I am still not sure I understand why those jars are shipped with Solr, if it is not actually integrating into Solr. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, Apr 14, 2014 at 8:36 PM, Wolfgang Hoschek whosc...@cloudera.com wrote: Currently all Solr morphline use cases I’m aware of run in processes outside of the Solr JVM, e.g. in Flume, in MapReduce, in HBase Lily Indexer, etc. These ingestion processes generate Solr documents for Solr updates. Running in external processes is done to improve scalability, reliability, flexibility and reusability. Not everything needs to run inside of the Solr JVM. We haven’t found a use case for it so far, but it would be easy to add an UpdateRequestProcessor that runs a morphline inside of the Solr JVM. Here is more background info: http://kitesdk.org/docs/current/kite-morphlines/index.html http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html http://files.meetup.com/5139282/SHUG10%20-%20Search%20On%20Hadoop.pdf Wolfgang. On Apr 14, 2014, at 2:26 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I saw that 4.7.1 has morphline and hadoop contribution libraries, but I can't figure out the degree to which they are useful to _Solr_ users. I found one hadoop example in the readme that does some sort injection into Solr. Is that the only use case supported? I thought that maybe there is a UpdateRequestProcessor or Handler end-point or something that hooks into morphline to do similar/alternative work to DataImportHandler. But I can't see any entry points or examples for that. Anybody knows what the story is and/or what the future holds? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: What is Overseer?
: So, is Overseer really only an implementation detail or something that Solr : Ops guys need to be very aware of? Most people don't ever need to worry about the overseer - it's magic and it will take care of itself. The recent work on adding support for an overseer role in 4.7 was specifically for people who *want* to worry about it. I've updated several places in the solr ref guide to remove some missleading claims about hte overseer (some old docs equated it to running embedded zookeeper) and add some more info to the glossary.. https://cwiki.apache.org/confluence/display/solr/Solr+Glossary#SolrGlossary-Overseer https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api15AddRole -Hoss http://www.lucidworks.com/
cache warming questions
Hello, I have a few questions regarding how Solr caches are warmed. My understanding is that there are two ways to warm internal Solr caches (only one way for document cache and lucene FieldCache): Auto warming - occurs when there is a current searcher handling requests and new searcher is being prepared. When a new searcher is opened, its caches may be prepopulated or autowarmed with cached object from caches in the old searcher. autowarmCount is the number of cached items that will be regenerated in the new searcher.http://wiki.apache.org/solr/SolrCaching#autowarmCount Explicit warming - where the static warming queries specified in Solrconfig.xml for newSearcher and firstSearcher listeners are executed when a new searcher is being prepared. What does it mean that items will be regenerated or prepopulated from the current searcher's cache to the new searcher's cache? I doubt it means copy, as the index has likely changed with a commit and possibly invalidated some contents of the cache. Are the queries, or filters, that define the contents of the current caches re-executed for the new searcher's caches? For the case where auto warming is configured, a current searcher is active, and static warming queries are defined how does auto warming and explicit warming work together? Or do they? Is only one type of warming activated to fill the caches? Thanks, Matt
Re: Question regarding solrj
Sorry for not replying!!! It was wrong version of solrj that client was using (As it was third-party code, we couldn't find out earlier). After fixing the version, things seem to be working fine. Thanks for your response!!! On Sun, Apr 13, 2014 at 7:26 PM, Erick Erickson erickerick...@gmail.comwrote: You say I can't change the client. What is the client written in? What does it expect? Does it use the same version of SolrJ? Best, Erick On Sun, Apr 13, 2014 at 6:40 AM, Prashant Golash prashant.gol...@gmail.com wrote: Thanks for your feedback. Following are some more details Version of solr : 4.3.0 Version of solrj : 4.3.0 The way I am returning response to client: Request Holder is the object containing post process request from client (After renaming few of the fields, and internal to external mapping of the fields) *Snippet of code* WS.WSRequestHolder requestHolder = WS.url(url); // requestHolder processing of few fields return requestHolder.get().map( new F.FunctionWS.Response, Result() { @Override public Result apply(WS.Response response) throws Throwable { System.out.println(Response header: + response.getHeader(Content-Type)); System.out.println(Response: + response.getBody()); *return ok(response.asByteArray()).as(response.getHeader(Content-Type));* } } ); Thanks, Prashant On Sun, Apr 13, 2014 at 3:35 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; If you had a chance to change the code at client side I would suggest to try that: http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html#setParser(org.apache.solr.client.solrj.ResponseParser) There maybe a problem about character encoding of your Play App and here is the information: Javabin is a custom binary format used to write out Solr's response in a fast and efficient manner. As of Solr 3.1, the JavaBin format has changed to version 2. Version 2 serializes strings differently: instead of writing the number of UTF-16 characters followed by the bytes in Modified UTF-8 it writes the number of UTF-8 bytes followed by the bytes in UTF-8. Which version of Solr and Solrj do you use respectively? On the other hand if you give us more information I can help you because there may be any other interesting thing as like here: https://issues.apache.org/jira/browse/SOLR-5744 Thanks; Furkan KAMACI 2014-04-12 22:18 GMT+03:00 Prashant Golash prashant.gol...@gmail.com: Hi Solr Gurus, I have some doubt related to solrj client. My scenario is like this: - There is a proxy server (Play App) which internally queries solr. - The proxy server is called from client side, which uses Solrj library. The issue is that I can't change client code. I can only change configurations to call different servers, hence I need to use SolrJ. - Results are successfully returned from my play app in *java-bin*format without modify them, but on client side, I am receiving this exception: Caused by: java.lang.NullPointerException * at org.apache.solr.common.util.JavaBinCodec.readExternString(JavaBinCodec.java:689)* * at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)* * at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)* * at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)* * at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)* * at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)* * at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)* * at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:310)* * at com.ibm.commerce.foundation.internal.server.services.search.util.SearchQueryHelper.query(SearchQueryHelper.java:125)* * at com.ibm.commerce.foundation.server.services.rest.search.processor.solr.SolrRESTSearchExpressionProcessor.performSearch(SolrRESTSearchExpressionProcessor.java:506)* * at com.ibm.commerce.foundation.server.services.search.SearchServiceFacade.performSearch(SearchS* erviceFacade.java:193) I am not sure, if this exception is related to some issue in response format or with respect to querying non-solr server from solrj. Let me know your thoughts Thanks, Prashant
Distributed commits in CloudSolrServer
I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 ZKs. The Solr indexes are behind a load balancer. There is one CloudSolrServer client updating the indexes. The index schema includes 3 ExternalFileFields. When the CloudSolrServer client issues a hard commit, I observe that the commits occur sequentially, not in parallel, on the leader and replica. The duration of each commit is about a minute. Most of this time is spent reloading the 3 ExternalFileField files. Because of the sequential commits, there is a period of time (1 minute+) when the index searchers will return different results, which can cause a bad user experience. This will get worse as replicas are added to handle auto-scaling. The goal is to keep all replicas in sync w.r.t. the user queries. My questions: 1. Is there a reason that the distributed commits are done in sequence, not in parallel? Is there a way to change this behavior? 2. If instead, the commits were done in parallel by a separate client via a GET to each Solr instance, how would this client get the host/port values for each Solr instance from zookeeper? Are there any downsides to doing commits this way? Thanks, Peter
Re: multiple analyzers for one field
Ha! You were right. Thanks for the nudge; here's my post: http://blog.safariflow.com/2014/04/15/search-suggestions-with-solr-2/ there's code at http://github.com/safarijv/ifpress-solr-plugin cheers -Mike On 04/15/2014 08:18 AM, Alexandre Rafalovitch wrote: Your call, though from experience thus sounds like either two or no blog posts. I certainly have killed a bunch of good articles by waiting for perfection:-) On 15/04/2014 7:01 pm, Michael Sokolov msoko...@safaribooksonline.com wrote: A blog post is a great idea, Alex! I think I should wait until I have a complete end-to-end implementation done before I write about it though, because I'd also like to include some tips about configuring the new suggesters with Solr (the documentation on the wiki hasn't quite caught up yet, I think), and I don't have that working as I'd like just yet. But I will follow up with something soon; probably I will be able to share code on a public repo. -Mike On 04/14/2014 10:01 PM, Alexandre Rafalovitch wrote: Hi Mike, Glad I was able to help. Good note about the PoolingReuseStrategy, I did not think of that either. Is there a blog post or a GitHub repository coming with more details on that? Sounds like something others may benefit from as well. Regards, Alex. P.s. If you don't have your own blog, I'll be happy to host such article on mine. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 15, 2014 at 8:52 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: I lost the original thread; sorry for the new / repeated topic, but thought I would follow up to let y'all know that I ended up implementing Alex's idea to implement an UpdateRequestProcessor in order to apply different analysis to different fields when doing something analogous to copyFields. It was pretty straightforward except that when there are multiple values, I ended up needing multiple copies of the same Analyzer. I had to implement a new PoolingReuseStrategy for the Analyzer to handle this, which I hadn't foreseen. -Mike
Re: Empty documents in Solr\lucene 3.6
Alexey, 1. Can you take a backup of the index and run the index checker with -fix option? Will it modify the index at all? 2. Are all the missing fields configured as stored? Are they marked as required in the schema or optional? Dmitry On Tue, Apr 15, 2014 at 7:22 PM, Alexey Kozhemiakin alexey_kozhemia...@epam.com wrote: The system was up and running for long time(months) without any updates. There was no crashes for sure, at least support team says so. Logs indicate that at some point there was not enough disk space (caused by weekend index optimization). Were there any other similar cases or it's unique for us? Alexey. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, April 15, 2014 18:50 To: solr-user@lucene.apache.org Subject: Re: Empty documents in Solr\lucene 3.6 Do you know for sure that the index was OK at some point? Do you know what might have happened when it became not OK, like a system crash? If you have Solr logs from whatever event caused the problem, we might be able to figure it out ... but if you don't know when it happened or you don't have logs, it might not be possible to know what happened. The document may have simply been indexed incorrectly. Thanks, Shawn -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: What is Overseer?
I should have suggested three levels in my question: 1) important to average users, 2) expert-only, and 3) internal implementation detail. Yes, expert-only does have a place, but it is good to mark features as such. -- Jack Krupansky -Original Message- From: Chris Hostetter Sent: Tuesday, April 15, 2014 1:48 PM To: solr-user@lucene.apache.org Subject: Re: What is Overseer? : So, is Overseer really only an implementation detail or something that Solr : Ops guys need to be very aware of? Most people don't ever need to worry about the overseer - it's magic and it will take care of itself. The recent work on adding support for an overseer role in 4.7 was specifically for people who *want* to worry about it. I've updated several places in the solr ref guide to remove some missleading claims about hte overseer (some old docs equated it to running embedded zookeeper) and add some more info to the glossary.. https://cwiki.apache.org/confluence/display/solr/Solr+Glossary#SolrGlossary-Overseer https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api15AddRole -Hoss http://www.lucidworks.com/
Re: Distributed commits in CloudSolrServer
Inline responses below. -- Mark Miller about.me/markrmiller On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com) wrote: I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 ZKs. The Solr indexes are behind a load balancer. There is one CloudSolrServer client updating the indexes. The index schema includes 3 ExternalFileFields. When the CloudSolrServer client issues a hard commit, I observe that the commits occur sequentially, not in parallel, on the leader and replica. The duration of each commit is about a minute. Most of this time is spent reloading the 3 ExternalFileField files. Because of the sequential commits, there is a period of time (1 minute+) when the index searchers will return different results, which can cause a bad user experience. This will get worse as replicas are added to handle auto-scaling. The goal is to keep all replicas in sync w.r.t. the user queries. My questions: 1. Is there a reason that the distributed commits are done in sequence, not in parallel? Is there a way to change this behavior? The reason is that updates are currently done this way - it’s the only safe way to do it without solving some more problems. I don’t think you can easily change this. I think we should probably file a JIRA issue to track a better solution for commit handling. I think there are some complications because of how commits can be added on update requests, but its something we probably want to try and solve before tackling *all* updates to replicas in parallel with the leader. 2. If instead, the commits were done in parallel by a separate client via a GET to each Solr instance, how would this client get the host/port values for each Solr instance from zookeeper? Are there any downsides to doing commits this way? Not really, other than the extra management. Thanks, Peter
Transformation on a numeric field
Hi All, I am looking for a way to index a numeric field and its value divided by 1 000 into another numeric field. I thought about using a CopyField with a PatternReplaceFilterFactory to keep only the first few digits (cutting the last three). Solr complains that I can not have an analysis chain on a numeric field: Core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType truncated_salary: FieldType: TrieIntField (truncated_salary) does not support specifying an analyzer. Schema file is /data/solr/solr-no-cloud/Core1/schema.xml Is there a way to accomplish this ? Thanks
Re: Transformation on a numeric field
Hello! You can achieve that using update processor, for example look here: http://wiki.apache.org/solr/ScriptUpdateProcessor What you would have to do, in general, is create a script that would take a value of the field, divide it by the 1000 and put it in another field - the target numeric field. -- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ Hi All, I am looking for a way to index a numeric field and its value divided by 1 000 into another numeric field. I thought about using a CopyField with a PatternReplaceFilterFactory to keep only the first few digits (cutting the last three). Solr complains that I can not have an analysis chain on a numeric field: Core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType truncated_salary: FieldType: TrieIntField (truncated_salary) does not support specifying an analyzer. Schema file is /data/solr/solr-no-cloud/Core1/schema.xml Is there a way to accomplish this ? Thanks
Re: Transformation on a numeric field
You can use an update processor. The stateless script update processor will let you write arbitrary JavaScript code, which can do this calculation. You should be able to figure it out from the wiki: http://wiki.apache.org/solr/ScriptUpdateProcessor My e-book has plenty of script examples for this processor as well. We could also write a generic script that takes a source and destination field name and then does a specified operation on it, like add an offset or multiple by a scale factor. -- Jack Krupansky -Original Message- From: Jean-Sebastien Vachon Sent: Tuesday, April 15, 2014 3:57 PM To: 'solr-user@lucene.apache.org' Subject: Transformation on a numeric field Hi All, I am looking for a way to index a numeric field and its value divided by 1 000 into another numeric field. I thought about using a CopyField with a PatternReplaceFilterFactory to keep only the first few digits (cutting the last three). Solr complains that I can not have an analysis chain on a numeric field: Core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType truncated_salary: FieldType: TrieIntField (truncated_salary) does not support specifying an analyzer. Schema file is /data/solr/solr-no-cloud/Core1/schema.xml Is there a way to accomplish this ? Thanks
Odd extra character duplicates in spell checking
Hi, I am going to make this question pretty short, so I don’t overwhelm with technical details until the end. I suspect that some folks may be seeing this issue without the particular configuration we are using. What our problem is: 1. Correctly spelled words are returning as not spelled correctly, with the original, correctly spelled word with a single oddball character appended as multiple suggestions. 2. Incorrectly spelled words are returning correct spelling suggestions with a single oddball character appended as multiple suggestions. 3. We’re seeing this in Solr 4.5x and 4.7x. Example: The return values are all a single character (unicode shown in square brackets). correction=attitude[2d] correction=attitude[2f] correction=attitude[2026] Spurious characters: * Unicode Character 'HYPHEN-MINUS' (U+002D) * Unicode Character 'SOLIDUS' (U+002F) * Unicode Character 'HORIZONTAL ELLIPSIS' (U+2026) Anybody see anything like this? Anybody fix something like this? Thanks! —Ed OK, here’s the gory details: What we are doing: We have developed an application that returns did you mean” spelling alternatives against a specific (presumably misspelled word). We’re using the vocabulary of indexed pages of a specified book as the source of the alternatives, so this is not a general dictionary spell check, we are returning only matching alternatives. So when I say “correctly spelled” I mean they are words found on at least one page. We are using the collations, so that we restrict ourselves to those pages in one book. We are having to check for and “fix up” these faulty results. That’s not a robust or desirable solution. We are using SolrJ to get the collations, private static final String DID_YOU_MEAN_REQUEST_HANDLER = /spell”; …. SolrQuery query = new SolrQuery(q); query.set(spellcheck, true); query.set(SpellingParams.SPELLCHECK_COUNT, 10); query.set(SpellingParams.SPELLCHECK_COLLATE, true); query.set(SpellingParams.SPELLCHECK_COLLATE_EXTENDED_RESULTS, true); query.set(wt, json); query.setRequestHandler(DID_YOU_MEAN_REQUEST_HANDLER); query.set(shards.qt, DID_YOU_MEAN_REQUEST_HANDLER); query.set(shards.tolerant, true); etc…… but we can duplicate the behavior without SolrJ with the collations/ misspellingsAndCorrections below:, e.g.: solr/pg1/spell?q=+doc-id:(810500)+AND+attitudexspellcheck=truespellcheck.count=10spellcheck.collate=truespellcheck.collateExtendedResults=truewt=jsonqt=%2Fspellshards.qt=%2Fspellshards.tolerant=true.out.print {responseHeader:{status:0,QTime:60},response:{numFound:0,start:0,maxScore:0.0,docs:[]},spellcheck:{suggestions:[attitudex,{numFound:6,startOffset:21,endOffset:30,origFreq:0,suggestion:[{word:attitudes,freq:362486},{word:attitu dex,freq:4819},{word:atti tudex,freq:3254},{word:attit udex,freq:159},{word:attitude-,freq:1080},{word:attituden,freq:261}]},correctlySpelled,false,collation,[collationQuery, doc-id:(810500) AND attitude-,hits,2,misspellingsAndCorrections,[attitudex,attitude-]],collation,[collationQuery, doc-id:(810500) AND attitude/,hits,2,misspellingsAndCorrections,[attitudex,attitude/]],collation,[collationQuery, doc-id:(810500) AND attitude…,hits,2,misspellingsAndCorrections,[attitudex,attitude…]]]}} The configuration is: requestHandler name=/spell class=solr.SearchHandler startup=lazy lst name=defaults str name=dftext/str str name=spellcheck.dictionarydefault/str str name=spellcheck.dictionarywordbreak/str str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str name=last-components strspellcheck/str /arr /requestHandler lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldtext/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges25/int int name=minBreakLength3/int /lst lst name=spellchecker str name=namedefault/str str name=fieldtext/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.2/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections25/int int name=minQueryLength4/int float name=maxQueryFrequency1/float /lst -- Ed Smiley, Senior Software Architect, eBooks ProQuest | 161 E Evelyn Ave| Mountain View, CA 94041 | USA | +1 650 475 8700 extension 3772
Re: svn vs GIT
I guess I should¹ve double-checked it was still the case before saying anything, but I¹m glad to be proven wrong. Yes, it worked nicely for me when I tried today, which should simplify my life a bit. On 4/14/14, 4:35 PM, Shawn Heisey s...@elyograg.org wrote: On 4/14/2014 12:56 PM, Ramkumar R. Aiyengar wrote: ant compile / ant -f solr dist / ant test certainly work, I use them with a git working copy. You trying something else? On 14 Apr 2014 19:36, Jeff Wartes jwar...@whitepages.com wrote: I vastly prefer git, but last I checked, (admittedly, some time ago) you couldn't build the project from the git clone. Some of the build scripts assumed some svn commands will work. The nightly-smoke build target uses svn. There is a related smoketest script that uses provided URL parameters (or svn if it's a checkout from svn and the parameters are not supplied) to obtain artifacts for testing. This may not be the only build target that uses facilities not available from git, but it's the only one that I know about for sure. Ordinary people should be able to use repositories cloned from the git.apache.org or github mirrors with no problem if they are not using exotic build targets or build scripts. When I tried 'ant precommit' it worked, but it did say at least once in what scrolled by that this was not an SVN checkout, so the '-check-svn-working-copy' build target (which is part of precommit) didn't work. Thanks, Shawn
Re: cache warming questions
bq: What does it mean that items will be regenerated or prepopulated from the current searcher's cache... You're right, the values aren't cached. They can't be since the internal Lucene document id is used to identify docs, and due to merging the internal ID may bear no relation to the old internal ID for a particular document. I find it useful to think of Solr's caches as a map where the key is the query and the value is some representation of the found documents. The details of the value don't matter, so I'll skip them. What matters is the key. Consider the filter cache. You put something like fq=price:[0 TO 100] on a URL. Solr then uses the fq clause as the key to the filterCache. Here's the sneaky bit. When you specify an autowarm count of N for the filterCache, when a new searcher is opened the first N keys from the map are re-executed in the new searcher's context and the results put into the new searcher's filterCache. bq: ...how does auto warming and explicit warming work together? They're orthogonal. IOW, the autowarming for each cache is executed as well as the newSearcher static warming queries. Use the static queries to do things like fill the sort caches etc. Incidentally, this bears on why there's a firstSearcher and newSearcher. The newSearcher queries are run in addition to the cache autowarms. firstSearcher static queries are only run when a Solr server is started the first time, and there are no cache entries to autowarm. So the firstSearcher queries might be quite a bit more complex than newSearcher queries. HTH, Erick On Tue, Apr 15, 2014 at 1:55 PM, Matt Kuiper matt.kui...@issinc.com wrote: Hello, I have a few questions regarding how Solr caches are warmed. My understanding is that there are two ways to warm internal Solr caches (only one way for document cache and lucene FieldCache): Auto warming - occurs when there is a current searcher handling requests and new searcher is being prepared. When a new searcher is opened, its caches may be prepopulated or autowarmed with cached object from caches in the old searcher. autowarmCount is the number of cached items that will be regenerated in the new searcher. http://wiki.apache.org/solr/SolrCaching#autowarmCount Explicit warming - where the static warming queries specified in Solrconfig.xml for newSearcher and firstSearcher listeners are executed when a new searcher is being prepared. What does it mean that items will be regenerated or prepopulated from the current searcher's cache to the new searcher's cache? I doubt it means copy, as the index has likely changed with a commit and possibly invalidated some contents of the cache. Are the queries, or filters, that define the contents of the current caches re-executed for the new searcher's caches? For the case where auto warming is configured, a current searcher is active, and static warming queries are defined how does auto warming and explicit warming work together? Or do they? Is only one type of warming activated to fill the caches? Thanks, Matt
Re: More Robust Search Timeouts (to Kill Zombie Queries)?
I have also experienced a similar problem on our cluster, I went ahead and opened SOLR-5986 to track the issue. I know Apache Blur has implemented a mechanism to kill these long running term enumerations, would be fantastic if Solr can get a similar mechanism. -Steve On Apr 15, 2014, at 5:23 AM, Salman Akram salman.ak...@northbaysolutions.net wrote: Looking at this, sharding seems to be best and simple option to handle such queries. On Wed, Apr 2, 2014 at 1:26 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Salman, Let's me drop few thoughts on http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E There two aspects of this question: 1. dealing with long running processing (thread divergence actions http://docs.oracle.com/javase/specs/jls/se5.0/html/memory.html#65310) and 2. an actual time checking. terminating or aborting thread (2.) are just a way to tracking time externally, and send interrupt() which the thread should react on, which they don't do now, and we returning to the core issue (1.) Solr's time allowed is to the proper way to handle this things, the only problem is that expect that the only core search is long running, but in your case rewriting MultiTermQuery-s takes a huge time. Let's consider this problem. First of all MultiTermQuery.rewrite() is the nearly design issue, after heavy rewrite occurs, it's thrown away, after search is done. I think the most straightforward way is to address this issue by caching these expensive queries. Solr does it well http://wiki.apache.org/solr/CommonQueryParameters#fq However, only for http://en.wikipedia.org/wiki/Conjunctive_normal_form like queries, there is a workaround allows to cache disjunction legs see http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html If you still want to run expensively rewritten queries you need to implement timeout check (similar to TimeLimitingCollector) for TermsEnum returned from MultiTermQuery.getTermsEnum(), wrapping an actual TermsEnums is the good way, to apply queries injecting time limiting wrapper TermsEnum, you might consider override methods like SolrQueryParserBase.newWildcardQuery(Term) or post process the query three after parsing. On Mon, Mar 31, 2014 at 2:24 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Anyone? On Wed, Mar 26, 2014 at 7:55 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: With reference to this thread http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E I wanted to know if there was any response to that or if Chris Harris himself can comment on what he ended up doing, that would be great! -- Regards, Salman Akram -- Regards, Salman Akram -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Regards, Salman Akram
Tipping point of solr shards (Num of docs / size)
Hi Gurus, In my solr cluster I've multiple shards and each shard containing ~500,000,000 documents total index size being ~1 TB. I was just wondering how much more can I keep on adding to the shard before we reach a tipping point and the performance starts to degrade? Also as best practice what is the recomended no of docs / size of shards . Txz in advance :) -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Re: deleting large amount data from solr cloud
Another update: I removed the replicas - to avoid the replication doing a full copy. I am able delete sizeable chunks of data. But the overall index size remains the same even after the deletes. It does not seem to go down. I understand that Solr would do this in background - but I don't seem to see the decrease in overall index size even after 1-2 hours. I can see a bunch of .del files in the index directory, but the it does not seem to get cleaned up. Is there anyway to monitor/follow the progress of index compaction? Also, does triggering optimize from the admin UI help to compact the index size on disk? Thanks Vinay On 14 April 2014 12:19, Vinay Pothnis poth...@gmail.com wrote: Some update: I removed the auto warm configurations for the various caches and reduced the cache sizes. I then issued a call to delete a day's worth of data (800K documents). There was no out of memory this time - but some of the nodes went into recovery mode. Was able to catch some logs this time around and this is what i see: *WARN [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync] PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr http://host1:8983/solr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates* *INFO [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy] PeerSync Recovery was not successful - trying replication. core=core1_shard1_replica2* *INFO [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy] Starting Replication Recovery. core=core1_shard1_replica2* *INFO [2014-04-14 18:11:00.535] [org.apache.solr.cloud.RecoveryStrategy] Begin buffering updates. core=core1_shard1_replica2* *INFO [2014-04-14 18:11:00.536] [org.apache.solr.cloud.RecoveryStrategy] Attempting to replicate from http://host2:8983/solr/core1_shard1_replica1/ http://host2:8983/solr/core1_shard1_replica1/. core=core1_shard1_replica2* *INFO [2014-04-14 18:11:00.536] [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false* *INFO [2014-04-14 18:11:01.964] [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http client, config:connTimeout=5000socketTimeout=2allowCompression=falsemaxConnections=1maxConnectionsPerHost=1* *INFO [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller] No value set for 'pollInterval'. Timer Task not started.* *INFO [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller] Master's generation: 1108645* *INFO [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller] Slave's generation: 1108627* *INFO [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller] Starting replication process* *INFO [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller] Number of files in latest index in master: 814* *INFO [2014-04-14 18:11:02.007] [org.apache.solr.core.CachingDirectoryFactory] return new directory for /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007* *INFO [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller] Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true* So, it looks like the number of updates is too huge for the regular replication and then it goes into full copy of index. And since our index size is very huge (350G), this is causing the cluster to go into recovery mode forever - trying to copy that huge index. I also read in some thread http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthat there is a limit of 100 documents. I wonder if this has been updated to make that configurable since that thread. If not, the only option I see is to do a trickle delete of 100 documents per second or something. Also - the other suggestion of using distributed=false might not help because the issue currently is that the replication is going to full copy. Any thoughts? Thanks Vinay On 14 April 2014 07:54, Vinay Pothnis poth...@gmail.com wrote: Yes, that is our approach. We did try deleting a day's worth of data at a time, and that resulted in OOM as well. Thanks Vinay On 14 April 2014 00:27, Furkan KAMACI furkankam...@gmail.com wrote: Hi; I mean you can divide the range (i.e. one week at each delete instead of one month) and try to check whether you still get an OOM or not. Thanks; Furkan KAMACI 2014-04-14 7:09 GMT+03:00 Vinay Pothnis poth...@gmail.com: Aman, Yes - Will do! Furkan, How do you mean by 'bulk delete'? -Thanks Vinay On 12 April 2014 14:49, Furkan KAMACI furkankam...@gmail.com wrote: Hi; Do you get any problems when you index
Re: Tipping point of solr shards (Num of docs / size)
You could look at this link to understand about the factors that affect the solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems Especially, the sections about RAM and disk cache. If the index grows too big for one node, it can lead to performance issues. From the looks of it, 500mil docs per shard - may be already pushing it. How much does that translate to in terms of index size on disk per shard? -vinay On 15 April 2014 21:44, Mukesh Jha me.mukesh@gmail.com wrote: Hi Gurus, In my solr cluster I've multiple shards and each shard containing ~500,000,000 documents total index size being ~1 TB. I was just wondering how much more can I keep on adding to the shard before we reach a tipping point and the performance starts to degrade? Also as best practice what is the recomended no of docs / size of shards . Txz in advance :) -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Re: Tipping point of solr shards (Num of docs / size)
My index size per shard varies b/w 250 GB to 1 TB. The cluster is performing well even now but thought it's high time to change it, so that a shard doesn't get too big On Wed, Apr 16, 2014 at 10:25 AM, Vinay Pothnis poth...@gmail.com wrote: You could look at this link to understand about the factors that affect the solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems Especially, the sections about RAM and disk cache. If the index grows too big for one node, it can lead to performance issues. From the looks of it, 500mil docs per shard - may be already pushing it. How much does that translate to in terms of index size on disk per shard? -vinay On 15 April 2014 21:44, Mukesh Jha me.mukesh@gmail.com wrote: Hi Gurus, In my solr cluster I've multiple shards and each shard containing ~500,000,000 documents total index size being ~1 TB. I was just wondering how much more can I keep on adding to the shard before we reach a tipping point and the performance starts to degrade? Also as best practice what is the recomended no of docs / size of shards . Txz in advance :) -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*