Re: Solr substring search
Hi: I would start looking: http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser And the org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.java Hope it helps. On Thu, Sep 5, 2013 at 11:30 PM, Scott Schneider scott_schnei...@symantec.com wrote: Hello, I'm trying to find out how Solr runs a query for *foo*. Google tells me that you need to use NGramFilterFactory for that kind of substring search, but I find that even with very simple fieldTypes, it just works. (Perhaps because I'm testing on very small data sets, Solr is willing to look through all the keywords.) e.g. This works on the tutorial. Can someone tell me exactly how this works and/or point me to the Lucene code that implements this? Thanks, Scott
Re: charfilter doesn't do anything
the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=* On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: And show us an input string and a query that fail. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, September 05, 2013 2:41 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything On 9/5/2013 10:03 AM, Andreas Owen wrote: i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: field name=text_html type=text_cutHtml indexed=true stored=true multiValued=true/ fieldType name=text_cutHtml class=solr.TextField analyzer !-- tokenizer class=solr.StandardTokenizerFactory/ -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=Zahlungsverkehr replacement=ASDFGHJK / tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern? I don't know about your second question. I don't know if that will be possible, but I'll leave that to someone who's more expert than I. As for the first question, here's what I have. Did you reindex? That will be required. http://wiki.apache.org/solr/HowToReindex Assuming that you did reindex, are you trying to search for ASDFGHJK in a field that contains more than just Zahlungsverkehr? The keyword tokenizer might not do what you expect - it tokenizes the entire input string as a single token, which means that you won't be able to search for single words in a multi-word field without wildcards, which are pretty slow. Note that both the pattern and replacement are case sensitive. This is how regex works. You haven't used a lowercase filter, which means that you won't be able to search for asdfghjk. Use the analysis tab in the UI on your core to see what Solr does to your field text. Thanks, Shawn
Re: unknown _stream_source_info while indexing rich doc in solr
I will try this,thanks -- View this message in context: http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088490.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solrcloud shards backup/restoration
The replication handler's backup command was built for pre-SolrCloud. It takes a snapshot of the index but it is unaware of the transaction log which is a key component in SolrCloud. Hence unless you stop updates, commit your changes and then take a backup, you will likely miss some updates. That being said, I'm curious to see how peer sync behaves when you try to restore from a snapshot. When you say that you haven't been successful in restoring, what exactly is the behaviour you observed? On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hello, I was looking for a good backup / recovery solution for the solrcloud indexes. I am more looking for restoring the indexes from the index snapshot, which can be taken using the replicationHandler's backup command. I am looking for something that works with solrcloud 4.3 eventually, but still relevant if you tested with a previous version. I haven't been successful in have the restored index replicate across the new replicas, after I restart all the nodes, with one node having the restored index. Is restoring the indexes on all the nodes the best way to do it ? -- Regards, -Aditya Sakhuja -- Regards, Shalin Shekhar Mangar.
Re: Solr documents update on index
Yes, if a document with the same key exists, then the old document will be deleted and replaced with the new document. You can also partially update documents (we call it atomic updates) which reads the old document from local index, updates it according to the request and then replaces the old document with the new one. See https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UpdatingOnlyPartofaDocument On Fri, Sep 6, 2013 at 1:03 AM, Luis Portela Afonso meligalet...@gmail.com wrote: Hi, I'm having a problem when solr indexes. It is updating documents already indexed. Is this a normal behavior? If a document with the same key already exists is it supposed to be updated? I has thinking that is supposed to just update if the information on the rss has changed. Appreciate your help -- Sent from Gmail Mobile -- Regards, Shalin Shekhar Mangar.
Re: bucket count for facets
Stats Component can give you a count of non-null values in a field. See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower smb-apa...@alcyon.net wrote: Is there a way to get the count of buckets (ie unique values) for a field facet? the rudimentary approach of course is to get back all buckets, but in some cases this is a huge amount of data. thanks, steve -- Regards, Shalin Shekhar Mangar.
Re: Odd behavior after adding an additional core.
Can you give exact steps to reproduce this problem? Also, are you sure you supplied numShards=4 while creating the collection? On Fri, Sep 6, 2013 at 12:20 AM, mike st. john mstj...@gmail.com wrote: using solr 4.4 , i used collection admin to create a collection 4shards replication - factor of 1 i did this so i could index my data, then bring in replicas later by adding cores via coreadmin i added a new core via coreadmin, what i noticed shortly after adding the core, the leader of the shard where the new replica was placed was marked active the new core marked as the leader and the routing was now set to implicit. i've replicated this on another solr setup as well. Any ideas? Thanks msj -- Regards, Shalin Shekhar Mangar.
Regarding reducing qtime
Hi I am currently using solr -3.5.0 indexed by wikipedia dump (50 gb) with java 1.6. I am searching the tweets in the solr. Currently it takes average of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole mointor tool, The report are heap usage of 10-50Mb, No of threads - 10-20 No of class around 3800,
monitoring Solr RAM with graphite
Hello! I remember some time ago people were interested in how Solr instances can be monitored with graphite. This blog post gives a hands-on example from my experience of monitoring RAM usage of Solr. http://dmitrykan.blogspot.fi/2013/09/monitoring-solr-with-graphite-and-carbon.html Please note, that this is not SOLR native monitoring, i.e. SOLR is more like a black box. It can still suffice to a persistent monitoring need. Further stats can be added with querying SOLR for cache usage and so on. Regards, Dmitry Kan
Re: Loading a SpellCheck dynamically
My guess is that you have a single request handler defined with all your language specific spell check components. This is why you see spellcheck values from all spellcheckers. If the above is true, then I don't think there is a way to choose one specific spellchecker component. The alternative is to define multiple request handlers with one-to-one mapping with the spell check components. Then you can send a request to one particular request handler and the corresponding spell check component will return its response. On Thu, Sep 5, 2013 at 11:29 PM, Mr Havercamp mrhaverc...@gmail.com wrote: I currently have multiple spellchecks configured in my solrconfig.xml to handle a variety of different spell suggestions in different languages. In the snippet below, I have a catch-all spellcheck as well as an English only one for more accurate matching (I.e. my schema.xml is set up to capture english only fields to an english-specific textSpell_en field and then I also capture to a generic textSpell field): ---solrconfig.xml--- searchComponent name=spellcheck_en class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell_en/str lst name=spellchecker str name=namedefault/str str name=fieldspell_en/str str name=spellcheckIndexDir./spellchecker_en/str str name=buildOnOptimizetrue/str /lst /searchComponent searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell/str lst name=spellchecker str name=namedefault/str str name=fieldspell/str str name=spellcheckIndexDir./spellchecker/str str name=buildOnOptimizetrue/str /lst /searchComponent My question is; when I query my Solr index, am I able to load, say, just spellcheck values from the spellcheck_en spellchecker rather than from both? This would be useful if I were to start implementing additional language spellchecks; E.g. spellcheck_ja, spellcheck_fr, etc. Thanks for any insights. Cheers Hayden -- Regards, Shalin Shekhar Mangar.
Regarding improving performance of the solr
Hi I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. The stats are Heap usage - 10-50Mb, No of threads - 10-20 No of class- 3800, Cpu usage - 10-15% Currently I am loading all the fields of the wikipedia. I only need the freebase category and wikipedia category. I want to know how to optimize the solr server to improve the performance. Could you please help me out in optimize the performance? Thanks and Regards Prabu
Re: Questions about Replication Factor on solrcloud
Comments inline: On Wed, Sep 4, 2013 at 10:38 PM, Lisandro Montaño lisan...@itivitykids.com wrote: Hi all, I’m currently working on deploying a solrcloud distribution in centos machines and wanted to have more guidance about Replication Factor configuration. I have configured two servers with solrcloud over tomcat and a third server as zookeeper. I have configured successfully and have one server with collection1 available and the other with collection1_Shard1_Replica1. How did you configure them this way? In particular, I'm confused as to why there is collection1 on the first node and collection1_Shard1_Replica1 on the other. My questions are: - Can I have 1 shard and 2 replicas on two machines? What are the limitations or considerations to define this? Yes you can have 1 shard and 2 replicas, one each on your two machines. That is the way it is configured by default. For example, this can be achieved if you create another collection (numShards=1replicationFactor=2) using the collection API. - How does replica works? (there is not too much info about it) All replicas (physical shards) are peers who decide on a leader using ZooKeeper. All updates are routed via the leader who forwards (versioned) updates to other replicas. A query can be served by any replica. If a replica goes down, then it will attempt to recover from the current leader and then start serving requests. If the leader goes down, then all the other replicas (after waiting for a certain time for the old leader to come back) decide on a new leader. - When I import data on collection1 it works properly, but when I do it in collection1_Shard1_Replica1 it fails. Is that an expected behavior? (Maybe if I have a better definition of replica’s I will understand it better) Can you describe how it fails? Stack traces or excerpts from the Solr logs will help. -- Regards, Shalin Shekhar Mangar.
Re: How to config SOLR server for spell check functionality
On Wed, Sep 4, 2013 at 4:56 PM, sebastian.manolescu sebastian.manole...@yahoo.com wrote: I want to implement spell check functionality offerd by solr using MySql database, but I dont understand how. Here the basic flow of what I want to do. I have a simple inputText (in jsf) and if I type the word shwo the response to OutputLabel should be show. First of all I'm using the following tools and frameworks: JBoss application server 6.1. Eclipse JPA JSF(Primefaces) Steps I've done until now: Step 1: Download solr server from: http://lucene.apache.org/solr/downloads.html Extract content. Step 2: Add to Envoierment variable: Variable name: solr.solr.home Variable value : D:\JBOSS\solr-4.4.0\solr-4.4.0\example\solr --- where you have the solr server Step 3: Open solr war and to solr.war\WEB-INF\web.xml add env-entry - (the easy way) solr/home D:\JBOSS\solr-4.4.0\solr-4.4.0\example\solr java.lang.String OR import project change and bulid war. Step 4: Browser: localhost:8080/solr/ And the solr console appears. Until now all works well. I have found some usefull code (my opinion) that returns: [collection1] webapp=/solr path=/spell params={spellcheck=onq=whateverwt=javabinqt=/spellversion=2spellcheck.build=true} hits=0 status=0 QTime=16 Here is the code that gives the result from above: SolrServer solr; try { solr = new CommonsHttpSolrServer(http://localhost:8080/solr;); ModifiableSolrParams params = new ModifiableSolrParams(); params.set(qt, /spell); params.set(q, whatever); params.set(spellcheck, on); params.set(spellcheck.build, true); QueryResponse response = solr.query(params); SpellCheckResponse spellCheckResponse = response.getSpellCheckResponse(); if (!spellCheckResponse.isCorrectlySpelled()) { for (Suggestion suggestion : response.getSpellCheckResponse().getSuggestions()) { System.out.println(original token: + suggestion.getToken() + - alternatives: + suggestion.getAlternatives()); } } } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } Questions: 1.How do I make the database connection whit my DB and search the content to see if there are any words that could match? You can either write SolrJ code to index data into Solr or you can use DataImportHandler. http://wiki.apache.org/solr/DIHQuickStart http://wiki.apache.org/solr/DataImportHandler 2.How do I make the configuration.(solr-config.xml,shema.xml...etc)? You must first edit the schema.xml according to your data. See https://cwiki.apache.org/confluence/display/solr/Documents%2C+Fields%2C+and+Schema+Design 3.How do I send a string from my view(xhtml) so that the solr server knows what he looks for? For search, you can use the SolrJ java client. https://cwiki.apache.org/confluence/display/solr/Searching http://wiki.apache.org/solr/Solrj#Reading_Data_from_Solr You seem to have done your homework and have found most of the resources. We will be able to help you in a better way if you asked specific questions instead. -- Regards, Shalin Shekhar Mangar.
Re: bucket count for facets
Understood, what I need is a count of the unique values in a field and that field is multi-valued (which makes stats component a non-option) On Fri, Sep 6, 2013 at 4:22 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Stats Component can give you a count of non-null values in a field. See https://cwiki.apache.org/confluence/display/solr/The+Stats+Component On Fri, Sep 6, 2013 at 12:28 AM, Steven Bower smb-apa...@alcyon.net wrote: Is there a way to get the count of buckets (ie unique values) for a field facet? the rudimentary approach of course is to get back all buckets, but in some cases this is a huge amount of data. thanks, steve -- Regards, Shalin Shekhar Mangar.
Restrict Parsing duplicate file in Solr
Hi I am new to Solr , I am looking for option of restricting duplicate file indexing in solr.Please let me know if it can be done with any configuration change. -- View this message in context: http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471.html Sent from the Solr - User mailing list archive at Nabble.com.
Store 2 dimensional array( of int values) in solr 4.0
hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. Basically I've the following data: [[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ... The inner array being used to keep some count say X for that particular day. Currently, I'm using the following field to store this data: field name=dataX type=string indexed=true stored=true multiValued=true/ and I'm using python library pySolr to store the data. Currently the data that gets stored looks like this(its array of strings) arr name=dataXstr[20121108, 1]/strstr[20121110, 7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 2]/strstr[20121116, 1]/str/arr Is there a way, i can store the 2 dimensional array and the inner array can contain int values, like the one shown in the beginning example, such that the the final/stored data in SOLR looks something like: arr name=dataX arr name=indexint20121108/int int 7 /int /arr arr name=indexint 20121110/intint 12 /int/arr arr name=indexint 20121110/intint 12 /int/arr /arr Just a guess, I think for this case, we need to add one more field[the index for instance], for each inner array which will again be multivalued (which will store int values only)? How do I add the actual 2 dimensional array, how to pass the inner arrays and how to store the full doc that contains this 2 dimensional array. Please help me out sort this issue. Please share your views and point me in the right direction. Any help would be highly appreciated. I found similar things on the web, but not the one I'm looking for: http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html Thanks
Re: Solr documents update on index
Hi, But i'm indexing rss feeds. I want that solr indexes that without change the existing information of a document with the same uniqueKey. The best approach is that solr updates the doc if changes are detected, but i can leave without that. I really would like that solr does not update the document if it already exists. I'm using the DataImportScheduler to solr index launch the scheduled index. Appreciate any possible help. On Sep 6, 2013, at 9:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Yes, if a document with the same key exists, then the old document will be deleted and replaced with the new document. You can also partially update documents (we call it atomic updates) which reads the old document from local index, updates it according to the request and then replaces the old document with the new one. See https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UpdatingOnlyPartofaDocument On Fri, Sep 6, 2013 at 1:03 AM, Luis Portela Afonso meligalet...@gmail.com wrote: Hi, I'm having a problem when solr indexes. It is updating documents already indexed. Is this a normal behavior? If a document with the same key already exists is it supposed to be updated? I has thinking that is supposed to just update if the information on the rss has changed. Appreciate your help -- Sent from Gmail Mobile -- Regards, Shalin Shekhar Mangar. smime.p7s Description: S/MIME cryptographic signature
SOLR 3.6.1 auto complete sorting
Hi, We had implemented Auto Complete feature in our site. Below are the solr config details. schema.xml fieldType class=solr.TextField name=text_auto positionIncrementGap=100 analyzer type=index filter class=solr.ASCIIFoldingFilterFactory / tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=30 minGramSize=1 / /analyzer analyzer type=query filter class=solr.ASCIIFoldingFilterFactory / tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType field name=dams_id type=string indexed=true stored=true / field name=published_date type=date indexed=true stored=false / field name=ph_su type=text_auto indexed=true stored=true multiValued=true / !-- Copy fields Auto Complete -- copyField source=title dest=ph_su / copyField source=product_catalogue dest=ph_su / copyField source=product_category_name dest=ph_su / solrquery is q=ph_su%3Aepub+start=0rows=10fl=dams_idwt=jsonindent=onhl=truehl.fl=ph_suhl.simple.pre=bhl.simple.post=/b the requirement is to sort the results based on releavance and latest published products for the search term. I have the below parameters but nothing worked sort = dams_id desc,published_date desc order_by = dams_id desc,published_date desc Please let me know how to sort the results with relevance and published date descending. Thanks, Poornima
Re: Store 2 dimensional array( of int values) in solr 4.0
First you need to tell us how you wish to use and query the data. That will largely determine how the data must be stored. Give us a few example queries of how you would like your application to be able to access the data. Note that Lucene has only simple multivalued fields - no structure or nesting within a single field other that a list of scalar values. But you can always store a complex structure as a BSON blob or JSON string if all you want is to store and retrieve it in its entirety without querying its internal structure. And note that Lucene queries are field level - does a field contain or match a scalar value. -- Jack Krupansky -Original Message- From: A Geek Sent: Friday, September 06, 2013 7:10 AM To: solr user Subject: Store 2 dimensional array( of int values) in solr 4.0 hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. Basically I've the following data: [[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ... The inner array being used to keep some count say X for that particular day. Currently, I'm using the following field to store this data: field name=dataX type=string indexed=true stored=true multiValued=true/ and I'm using python library pySolr to store the data. Currently the data that gets stored looks like this(its array of strings) arr name=dataXstr[20121108, 1]/strstr[20121110, 7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 2]/strstr[20121116, 1]/str/arr Is there a way, i can store the 2 dimensional array and the inner array can contain int values, like the one shown in the beginning example, such that the the final/stored data in SOLR looks something like: arr name=dataX arr name=indexint20121108/int int 7 /int /arr arr name=indexint 20121110/intint 12 /int/arr arr name=indexint 20121110/intint 12 /int/arr /arr Just a guess, I think for this case, we need to add one more field[the index for instance], for each inner array which will again be multivalued (which will store int values only)? How do I add the actual 2 dimensional array, how to pass the inner arrays and how to store the full doc that contains this 2 dimensional array. Please help me out sort this issue. Please share your views and point me in the right direction. Any help would be highly appreciated. I found similar things on the web, but not the one I'm looking for: http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html Thanks
Re: Restrict Parsing duplicate file in Solr
Explain what you mean by restring duplicate file indexing. Solr doesn't work at the file level - only documents (rows or records) and fields and values. -- Jack Krupansky -Original Message- From: shabbir Sent: Friday, September 06, 2013 12:24 AM To: solr-user@lucene.apache.org Subject: Restrict Parsing duplicate file in Solr Hi I am new to Solr , I am looking for option of restricting duplicate file indexing in solr.Please let me know if it can be done with any configuration change. -- View this message in context: http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: charfilter doesn't do anything
Is there any chance that your changed your schema since you indexed the data? If so, re-index the data. If a * query finds nothing, that implies that the default field is empty. Are you sure the df parameter is set to the field containing your data? Show us your request handler definition and a sample of your actual Solr input (Solr XML or JSON?) so that we can see what fields are being populated. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Friday, September 06, 2013 4:01 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=* On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: And show us an input string and a query that fail. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, September 05, 2013 2:41 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything On 9/5/2013 10:03 AM, Andreas Owen wrote: i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: field name=text_html type=text_cutHtml indexed=true stored=true multiValued=true/ fieldType name=text_cutHtml class=solr.TextField analyzer !-- tokenizer class=solr.StandardTokenizerFactory/ -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=Zahlungsverkehr replacement=ASDFGHJK / tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern? I don't know about your second question. I don't know if that will be possible, but I'll leave that to someone who's more expert than I. As for the first question, here's what I have. Did you reindex? That will be required. http://wiki.apache.org/solr/HowToReindex Assuming that you did reindex, are you trying to search for ASDFGHJK in a field that contains more than just Zahlungsverkehr? The keyword tokenizer might not do what you expect - it tokenizes the entire input string as a single token, which means that you won't be able to search for single words in a multi-word field without wildcards, which are pretty slow. Note that both the pattern and replacement are case sensitive. This is how regex works. You haven't used a lowercase filter, which means that you won't be able to search for asdfghjk. Use the analysis tab in the UI on your core to see what Solr does to your field text. Thanks, Shawn
Re: charfilter doesn't do anything
i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote: Is there any chance that your changed your schema since you indexed the data? If so, re-index the data. If a * query finds nothing, that implies that the default field is empty. Are you sure the df parameter is set to the field containing your data? Show us your request handler definition and a sample of your actual Solr input (Solr XML or JSON?) so that we can see what fields are being populated. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Friday, September 06, 2013 4:01 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=* On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: And show us an input string and a query that fail. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, September 05, 2013 2:41 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything On 9/5/2013 10:03 AM, Andreas Owen wrote: i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: field name=text_html type=text_cutHtml indexed=true stored=true multiValued=true/ fieldType name=text_cutHtml class=solr.TextField analyzer !-- tokenizer class=solr.StandardTokenizerFactory/ -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=Zahlungsverkehr replacement=ASDFGHJK / tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern? I don't know about your second question. I don't know if that will be possible, but I'll leave that to someone who's more expert than I. As for the first question, here's what I have. Did you reindex? That will be required. http://wiki.apache.org/solr/HowToReindex Assuming that you did reindex, are you trying to search for ASDFGHJK in a field that contains more than just Zahlungsverkehr? The keyword tokenizer might not do what you expect - it tokenizes the entire input string as a single token, which means that you won't be able to search for single words in a multi-word field without wildcards, which are pretty slow. Note that both the pattern and replacement are case sensitive. This is how regex works. You haven't used a lowercase filter, which means that you won't be able to search for asdfghjk. Use the analysis tab in the UI on your core to see what Solr does to your field text. Thanks, Shawn
RE: Regarding improving performance of the solr
Have you checked the hit ratio of the different caches? Try to tune them to get rid of all evictions if possible. Tuning the size of the caches and warming you searcher can give you a pretty good improvement. You might want to check your analysis chain as well to see if you`re not doing anything that is not necessary. -Original Message- From: prabu palanisamy [mailto:pr...@serendio.com] Sent: September-06-13 4:55 AM To: solr-user@lucene.apache.org Subject: Regarding improving performance of the solr Hi I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. The stats are Heap usage - 10-50Mb, No of threads - 10-20 No of class- 3800, Cpu usage - 10-15% Currently I am loading all the fields of the wikipedia. I only need the freebase category and wikipedia category. I want to know how to optimize the solr server to improve the performance. Could you please help me out in optimize the performance? Thanks and Regards Prabu - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2013.0.3392 / Base de données virale: 3222/6640 - Date: 05/09/2013
Re: Solr Cell Question
It's always frustrating when someone replies with Why not do it a completely different way?. But I will anyway :). There's no requirement at all that you send things to Solr to make Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ anyway, why not just parse on the client? This has the advantage of allowing you to offload the Tika processing from Solr which can be quite expensive. You can use the same Tika jars that come with Solr or download whatever version from the Tika project you want. That way, you can exercise much better control over what's done. Here's a skeletal program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote: Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set(defaultField, content); params.set(xpath, /xhtml:html/xhtml:body/descendant::node()); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File(..)); up.addContentStream(f); up.setAction(ACTION.COMMIT, true, true); solrServer.request(up); But the result of content is as follows arr name=content_mvtxt str/ strnull/str strISO-8859-1/str strtext/plain; charset=ISO-8859-1/str strJust a little test/str /arr What I had hoped for was just arr name=content_mvtxt strJust a little test/str /arr
Facet Count and RegexTransformersplitBy
Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan
RE: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core
Thanks for clearing that up Erick. The updateLog XML element isn't present in any of the solrconfig.xml files, so I don't believe this is enabled. I posted the directory listing of all of the core data directories in a prior post, but there are no files/folders found that contain tlog in the name of them. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, September 06, 2013 9:18 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core bq: I'm actually not using the transaction log (or the NRTCachingDirectoryFactory); it's currently set up to use the MMapDirectoryFactory, This isn't relevant to whether you're using the update log or not, this is just how the index is handled. Look for something in your solrconfig.xml like: updateLog str name=dir${solr.ulog.dir:}/str /updateLog The other thing to check is if you have files in a tlog directory that's a sibling to your index directory as Hoss suggested. You may well NOT have any transaction log, but it's something to check.
Re: solrcloud shards backup/restoration
I don't know that it's too bad though - its always been the case that if you do a backup while indexing, it's just going to get up to the last hard commit. With SolrCloud that will still be the case. So just make sure you do a hard commit right before taking the backup - yes, it might miss a few docs in the tran log, but if you are taking a back up while indexing, you don't have great precision in any case - you will roughly get a snapshot for around that time - even without SolrCloud, if you are worried about precision and getting every update into that backup, you want to stop indexing and commit first. But if you just want a rough snapshot for around that time, in both cases you can still just don't hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: The replication handler's backup command was built for pre-SolrCloud. It takes a snapshot of the index but it is unaware of the transaction log which is a key component in SolrCloud. Hence unless you stop updates, commit your changes and then take a backup, you will likely miss some updates. That being said, I'm curious to see how peer sync behaves when you try to restore from a snapshot. When you say that you haven't been successful in restoring, what exactly is the behaviour you observed? On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hello, I was looking for a good backup / recovery solution for the solrcloud indexes. I am more looking for restoring the indexes from the index snapshot, which can be taken using the replicationHandler's backup command. I am looking for something that works with solrcloud 4.3 eventually, but still relevant if you tested with a previous version. I haven't been successful in have the restored index replicate across the new replicas, after I restart all the nodes, with one node having the restored index. Is restoring the indexes on all the nodes the best way to do it ? -- Regards, -Aditya Sakhuja -- Regards, Shalin Shekhar Mangar.
Re: solrcloud shards backup/restoration
Phone typing. The end should not say don't hard commit - it should say do a hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 7:26 AM, Mark Miller markrmil...@gmail.com wrote: I don't know that it's too bad though - its always been the case that if you do a backup while indexing, it's just going to get up to the last hard commit. With SolrCloud that will still be the case. So just make sure you do a hard commit right before taking the backup - yes, it might miss a few docs in the tran log, but if you are taking a back up while indexing, you don't have great precision in any case - you will roughly get a snapshot for around that time - even without SolrCloud, if you are worried about precision and getting every update into that backup, you want to stop indexing and commit first. But if you just want a rough snapshot for around that time, in both cases you can still just don't hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: The replication handler's backup command was built for pre-SolrCloud. It takes a snapshot of the index but it is unaware of the transaction log which is a key component in SolrCloud. Hence unless you stop updates, commit your changes and then take a backup, you will likely miss some updates. That being said, I'm curious to see how peer sync behaves when you try to restore from a snapshot. When you say that you haven't been successful in restoring, what exactly is the behaviour you observed? On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hello, I was looking for a good backup / recovery solution for the solrcloud indexes. I am more looking for restoring the indexes from the index snapshot, which can be taken using the replicationHandler's backup command. I am looking for something that works with solrcloud 4.3 eventually, but still relevant if you tested with a previous version. I haven't been successful in have the restored index replicate across the new replicas, after I restart all the nodes, with one node having the restored index. Is restoring the indexes on all the nodes the best way to do it ? -- Regards, -Aditya Sakhuja -- Regards, Shalin Shekhar Mangar.
Re: Solr substring search
Yah, you're getting away with it due to the small data size. As your data grows, the underlying mechanisms have to enumerate every term in the field in order to find terms that match so it can get _very_ expensive with large data sets. Best to bite the bullet early or, better yet, see if you really need to support this use-case. Best, Erick On Fri, Sep 6, 2013 at 2:58 AM, Alvaro Cabrerizo topor...@gmail.com wrote: Hi: I would start looking: http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser And the org.apache.lucene.queryparser.flexible.standard.StandardQueryParser.java Hope it helps. On Thu, Sep 5, 2013 at 11:30 PM, Scott Schneider scott_schnei...@symantec.com wrote: Hello, I'm trying to find out how Solr runs a query for *foo*. Google tells me that you need to use NGramFilterFactory for that kind of substring search, but I find that even with very simple fieldTypes, it just works. (Perhaps because I'm testing on very small data sets, Solr is willing to look through all the keywords.) e.g. This works on the tutorial. Can someone tell me exactly how this works and/or point me to the Lucene code that implements this? Thanks, Scott
RE: Store 2 dimensional array( of int values) in solr 4.0
Hi,Thanks for the quick reply. Sure, please find below the details as per your query. Essentially, I want to retrieve the doc through JSON [using JSON format as SOLR result output]and want JSON to pick the the data from the dataX field as a two dimensional array of ints. When I store the data as show below, it shows up in JSON array of strings where the internal array is basically shown as strings (because thats how the field is configured and I'm storing, not finding any other option). Following is the current JSON output that I'm able to fetch: dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 8],[20130620, 5],[20130623, 5]] whereas I want to fetch the dataX as something like: dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 8],[20130620, 5],[20130623, 5]] as can be seen, the dataX is essentially a 2D array where the internal array is of two ints, one being date and other being the count. Please point me in the right direction. Appreciate your time. Thanks. From: j...@basetechnology.com To: solr-user@lucene.apache.org Subject: Re: Store 2 dimensional array( of int values) in solr 4.0 Date: Fri, 6 Sep 2013 08:44:06 -0400 First you need to tell us how you wish to use and query the data. That will largely determine how the data must be stored. Give us a few example queries of how you would like your application to be able to access the data. Note that Lucene has only simple multivalued fields - no structure or nesting within a single field other that a list of scalar values. But you can always store a complex structure as a BSON blob or JSON string if all you want is to store and retrieve it in its entirety without querying its internal structure. And note that Lucene queries are field level - does a field contain or match a scalar value. -- Jack Krupansky -Original Message- From: A Geek Sent: Friday, September 06, 2013 7:10 AM To: solr user Subject: Store 2 dimensional array( of int values) in solr 4.0 hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. Basically I've the following data: [[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ... The inner array being used to keep some count say X for that particular day. Currently, I'm using the following field to store this data: field name=dataX type=string indexed=true stored=true multiValued=true/ and I'm using python library pySolr to store the data. Currently the data that gets stored looks like this(its array of strings) arr name=dataXstr[20121108, 1]/strstr[20121110, 7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 2]/strstr[20121116, 1]/str/arr Is there a way, i can store the 2 dimensional array and the inner array can contain int values, like the one shown in the beginning example, such that the the final/stored data in SOLR looks something like: arr name=dataX arr name=indexint20121108/int int 7 /int /arr arr name=indexint 20121110/intint 12 /int/arr arr name=indexint 20121110/intint 12 /int/arr /arr Just a guess, I think for this case, we need to add one more field[the index for instance], for each inner array which will again be multivalued (which will store int values only)? How do I add the actual 2 dimensional array, how to pass the inner arrays and how to store the full doc that contains this 2 dimensional array. Please help me out sort this issue. Please share your views and point me in the right direction. Any help would be highly appreciated. I found similar things on the web, but not the one I'm looking for: http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html Thanks
Re: Invalid Version when slave node pull replication from master node
Whoa! You should _not_ be using replication with SolrCloud. You can use replication just fine with 4.4, just like you would have in 3.x say, but in that case you should not be using the zkHost or zkRun parameters, should not have a ZooKeeper ensemble running etc. In SolrCloud, all updates are routed to all the nodes at index time, otherwise it couldn't support, say, NRT processing. This makes replication not only unnecessary, but I wouldn't want to try to predict what problems that would cause. So keep a sharp distinction between running Solr 4x and SolrCloud. The latter is specifically enabled when you specify zkHost or zkRun when you start Solr as per the SolrCloud page. Best Erick On Wed, Sep 4, 2013 at 11:32 PM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi all I solve the problem by add the coreName explicitly according to http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml. But I want to make sure about that is it necessary to set the coreName explicitly. Is there any SolrJ API to pull the replication on the slave node from the master node? regards 2013/9/5 YouPeng Yang yypvsxf19870...@gmail.com Hi again I'm using Solr4.4. 2013/9/5 YouPeng Yang yypvsxf19870...@gmail.com HI solrusers I'm testing the replication within SolrCloud . I just uncomment the replication section separately on the master and slave node. The replication section setting on the master node: lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilesschema.xml,stopwords.txt/str /lst and on the slave node: lst name=slave str name=masterUrlhttp://10.7.23.124:8080/solr/#//str str name=pollInterval00:00:50/str /lst After startup, an Error comes out on the slave node : 80110110 [snapPuller-70-thread-1] ERROR org.apache.solr.handler.SnapPuller ?.Master at: http://10.7.23.124:8080/solr/#/ is not available. Index fetch failed. Exception: Invalid version (expected 2, but 60) or the data in not in 'javabin' format Could anyone help me to solve the problem ? regards
Re: Tweaking boosts for more search results variety
Thank you Jack for the suggestion. We can try group by site. But considering that number of sites are only about 1000 against the index size of 5 million, One can expect most of the hits would be hidden and for certain specific keywords only a handful of actual results could be displayed if results are grouped by site. we already group on a signature field to identify duplicate content in these 5 million+ docs. But here the number of duplicates are only about 3-5% maximum. Is there any workaround for these limitations with grouping? Thanks Shyam On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky j...@basetechnology.comwrote: The grouping (field collapsing) feature somewhat addresses this - group by a site field and then if more than one or a few top pages are from the same site they get grouped or collapsed so that you can see more sites in a few results. See: http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing https://cwiki.apache.org/**confluence/display/solr/**Result+Groupinghttps://cwiki.apache.org/confluence/display/solr/Result+Grouping -- Jack Krupansky -Original Message- From: Sai Gadde Sent: Thursday, September 05, 2013 2:27 AM To: solr-user@lucene.apache.org Subject: Tweaking boosts for more search results variety Our index is aggregated content from various sites on the web. We want good user experience by showing multiple sites in the search results. In our setup we are seeing most of the results from same site on the top. Here is some information regarding queries and schema site - String field. We have about 1000 sites in index sitetype - String field. we have 3 site types omitNorms=true for both the fields Doc count varies largely based on site and sitetype by a factor of 10 - 1000 times Total index size is about 5 million docs. Solr Version: 4.0 In our queries we have a fixed and preferential boost for certain sites. sitetype has different and fixed boosts for 3 possible values. We turned off Inverse Document Frequency (IDF) for these boosts to work properly. Other text fields are boosted based on search keywords only. With this setup we often see a bunch of hits from a single site followed by next etc., Is there any solution to see results from variety of sites and still keep the preferential boosts in place?
Re: SolrCloud 4.x hangs under high update volume
Markus: See: https://issues.apache.org/jira/browse/SOLR-5216 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Mark, Got an issue to watch? Thanks, Markus -Original message- From:Mark Miller markrmil...@gmail.com Sent: Wednesday 4th September 2013 16:55 To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.x hangs under high update volume I'm going to try and fix the root cause for 4.5 - I've suspected what it is since early this year, but it's never personally been an issue, so it's rolled along for a long time. Mark Sent from my iPhone On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it. Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/Connection Refused, and the shards become down in state. Sometimes a node or two survives and just returns 503s no server hosting shard errors. As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client - solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely. Our current environment is the following: - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. - 3 x Zookeeper instances, external Java 7 JVM. - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard). - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day. - 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads ulimit is 6000. - Occurs under Jetty 8 or 9 (many versions). - Occurs under Java 1.6 or 1.7 (several minor versions). - Occurs under several JVM tunings. - Everything seems to point to Solr itself, and not a Jetty or Java version (I hope I'm wrong). The stack trace that is holding up all my Jetty QTP threads is the following, which seems to be waiting on a lock that I would very much like to understand further: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007216e68d8 (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) at org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) at
Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core
bq: I'm actually not using the transaction log (or the NRTCachingDirectoryFactory); it's currently set up to use the MMapDirectoryFactory, This isn't relevant to whether you're using the update log or not, this is just how the index is handled. Look for something in your solrconfig.xml like: updateLog str name=dir${solr.ulog.dir:}/str /updateLog The other thing to check is if you have files in a tlog directory that's a sibling to your index directory as Hoss suggested. You may well NOT have any transaction log, but it's something to check.
Re: Facet Count and RegexTransformersplitBy
Facet counts are per field - your counts are scattered across different fields. There are additional capabilities in the facet component, but first you should describe exactly what your requirements are. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Facet Count and RegexTransformersplitBy Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan
Re: charfilter doesn't do anything
On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
Re: Store 2 dimensional array( of int values) in solr 4.0
You still haven't supplied any queries. If all you really need is the JSON as a blob, simply store it as a string and parse the JSON in your application layer. -- Jack Krupansky -Original Message- From: A Geek Sent: Friday, September 06, 2013 10:30 AM To: solr user Subject: RE: Store 2 dimensional array( of int values) in solr 4.0 Hi,Thanks for the quick reply. Sure, please find below the details as per your query. Essentially, I want to retrieve the doc through JSON [using JSON format as SOLR result output]and want JSON to pick the the data from the dataX field as a two dimensional array of ints. When I store the data as show below, it shows up in JSON array of strings where the internal array is basically shown as strings (because thats how the field is configured and I'm storing, not finding any other option). Following is the current JSON output that I'm able to fetch: dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 8],[20130620, 5],[20130623, 5]] whereas I want to fetch the dataX as something like: dataX:[[20130614, 2],[20130615, 11],[20130616, 1],[20130617, 1],[20130619, 8],[20130620, 5],[20130623, 5]] as can be seen, the dataX is essentially a 2D array where the internal array is of two ints, one being date and other being the count. Please point me in the right direction. Appreciate your time. Thanks. From: j...@basetechnology.com To: solr-user@lucene.apache.org Subject: Re: Store 2 dimensional array( of int values) in solr 4.0 Date: Fri, 6 Sep 2013 08:44:06 -0400 First you need to tell us how you wish to use and query the data. That will largely determine how the data must be stored. Give us a few example queries of how you would like your application to be able to access the data. Note that Lucene has only simple multivalued fields - no structure or nesting within a single field other that a list of scalar values. But you can always store a complex structure as a BSON blob or JSON string if all you want is to store and retrieve it in its entirety without querying its internal structure. And note that Lucene queries are field level - does a field contain or match a scalar value. -- Jack Krupansky -Original Message- From: A Geek Sent: Friday, September 06, 2013 7:10 AM To: solr user Subject: Store 2 dimensional array( of int values) in solr 4.0 hi All, I'm trying to store a 2 dimensional array in SOLR [version 4.0]. Basically I've the following data: [[20121108, 1],[20121110, 7],[2012, 2],[20121112, 2]] ... The inner array being used to keep some count say X for that particular day. Currently, I'm using the following field to store this data: field name=dataX type=string indexed=true stored=true multiValued=true/ and I'm using python library pySolr to store the data. Currently the data that gets stored looks like this(its array of strings) arr name=dataXstr[20121108, 1]/strstr[20121110, 7]/strstr[2012, 2]/strstr[20121112, 2]/strstr[20121113, 2]/strstr[20121116, 1]/str/arr Is there a way, i can store the 2 dimensional array and the inner array can contain int values, like the one shown in the beginning example, such that the the final/stored data in SOLR looks something like: arr name=dataX arr name=indexint20121108/int int 7 /int /arr arr name=indexint 20121110/intint 12 /int/arr arr name=indexint 20121110/intint 12 /int/arr /arr Just a guess, I think for this case, we need to add one more field[the index for instance], for each inner array which will again be multivalued (which will store int values only)? How do I add the actual 2 dimensional array, how to pass the inner arrays and how to store the full doc that contains this 2 dimensional array. Please help me out sort this issue. Please share your views and point me in the right direction. Any help would be highly appreciated. I found similar things on the web, but not the one I'm looking for: http://lucene.472066.n3.nabble.com/Two-dimensional-array-in-Solr-schema-td4003309.html Thanks
Re: Regarding improving performance of the solr
On 9/6/2013 2:54 AM, prabu palanisamy wrote: I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. If the size of all your Solr indexes on disk is in the 50GB range of your wikipedia dump, then for ideal performance, you'll want to have 50GB of free memory so the OS can cache your index. You might be able to get by with 25-30GB of free memory, depending on your index composition. Note that this is memory over and above what you allocate to the Solr JVM, and memory used by other processes on the machine. If you do have other services on the same machine, note that those programs might ALSO require OS disk cache RAM. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn
Re: charfilter doesn't do anything
ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
CRLF Invalid Exception ?
Has anyone ever hit this when adding documents to SOLR? What does it mean? ERROR [http-8983-6] 2013-09-06 10:09:32,700 SolrException.java (line 108) org.apache.solr.common.SolrException: Invalid CRLF at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:175) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:663) at com.datastax.bdp.cassandra.index.solr.CassandraDispatchFilter.execute(CassandraDispatchFilter.java:176) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at com.datastax.bdp.cassandra.index.solr.CassandraDispatchFilter.doFilter(CassandraDispatchFilter.java:139) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.datastax.bdp.cassandra.audit.SolrHttpAuditLogFilter.doFilter(SolrHttpAuditLogFilter.java:194) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.datastax.bdp.cassandra.index.solr.auth.CassandraAuthorizationFilter.doFilter(CassandraAuthorizationFilter.java:95) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at com.datastax.bdp.cassandra.index.solr.auth.DseAuthenticationFilter.doFilter(DseAuthenticationFilter.java:102) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:722) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:387) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) ... 30 more Caused by: java.io.IOException: Invalid CRLF at org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352) at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151) at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710) at org.apache.coyote.Request.doRead(Request.java:428) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:403) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMoreFromCurrent(StreamScanner.java:1046) at com.ctc.wstx.sr.StreamScanner.parseLocalName2(StreamScanner.java:1796) at com.ctc.wstx.sr.StreamScanner.parseLocalName(StreamScanner.java:1756) at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2914) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) ... 33 more
Re: Facet Count and RegexTransformersplitBy
You're not being clear here - are the commas delimiting fields or do you have one value per row? Yes, you can tokenize a comma-delimited value in Solr. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 11:54 AM To: solr-user@lucene.apache.org Subject: Re: Facet Count and RegexTransformersplitBy Hi, What I want is very simple: The query results: row 1 = a,b,c,d row 2 = a,f,r,e row 3 = a,c,ff,e,b .. facet count needed: 'a' = 3 occurrence 'b' = 2 occur. 'c' = 2 occur. . . . I searched and found a solution here: http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values But I want to be sure if it will work. On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.comwrote: Facet counts are per field - your counts are scattered across different fields. There are additional capabilities in the facet component, but first you should describe exactly what your requirements are. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Facet Count and RegexTransformersplitBy Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: Facet Count and RegexTransformersplitBy
Its a csv from the database. I will import it like this, (say for example the field is 'emailids' and it contain csv of email ids): field column=mailId splitBy=, sourceColName=emailids/ On Fri, Sep 6, 2013 at 9:01 PM, Jack Krupansky j...@basetechnology.comwrote: You're not being clear here - are the commas delimiting fields or do you have one value per row? Yes, you can tokenize a comma-delimited value in Solr. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 11:54 AM To: solr-user@lucene.apache.org Subject: Re: Facet Count and RegexTransformersplitBy Hi, What I want is very simple: The query results: row 1 = a,b,c,d row 2 = a,f,r,e row 3 = a,c,ff,e,b .. facet count needed: 'a' = 3 occurrence 'b' = 2 occur. 'c' = 2 occur. . . . I searched and found a solution here: http://stackoverflow.com/**questions/9914483/solr-facet-** multiple-words-with-comma-**separated-valueshttp://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values But I want to be sure if it will work. On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.com** wrote: Facet counts are per field - your counts are scattered across different fields. There are additional capabilities in the facet component, but first you should describe exactly what your requirements are. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Facet Count and RegexTransformersplitBy Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Connection Established but waiting for response for a long time.
Hi, I'm runing solr 4.0 but using legacy distributed search set up. I set the shards parameter for search, but indexing into each solr shards directly. The problem I have been experiencing is building connection with solr shards. If I run a query, by using wget, to get number of records from each individual shards (50 of them) sequentially, the request will hang at some shards (seems random). The wget log will say the connection is established but waiting for response. At that point I thought that the Solr shard might be under high load, but the strange behavior happens when I send another request to the send shard (using wget again) from another thread, the response comes back, and will trigger something in Solr to send back response for the first request I have sent before. This also happens in my daily indexing. If I send an commit, it will some times hangs. However, if I send another commit to the same shard, both commit will come back fine. I'm running Solr on stock jetty server, and sometime back my boss told me to set the maxIdleTime to 5000 for indexing purpose. I'm not sure if this have anything to do with the strange behavior that I'm seeing right now. Please help me resolve this issue. Thanks, Qun -- View this message in context: http://lucene.472066.n3.nabble.com/Connection-Established-but-waiting-for-response-for-a-long-time-tp4088587.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Count and RegexTransformersplitBy
let me further elaborate: [dbtable1] field1 = int field2= string (solr indexing = true) field3 = csv [During import into solr] splitBy=, [After import] solr will be searched for terms from field2. [needed] counts of occurrances of each value in csv On Fri, Sep 6, 2013 at 9:35 PM, Raheel Hasan raheelhasan@gmail.comwrote: Its a csv from the database. I will import it like this, (say for example the field is 'emailids' and it contain csv of email ids): field column=mailId splitBy=, sourceColName=emailids/ On Fri, Sep 6, 2013 at 9:01 PM, Jack Krupansky j...@basetechnology.comwrote: You're not being clear here - are the commas delimiting fields or do you have one value per row? Yes, you can tokenize a comma-delimited value in Solr. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 11:54 AM To: solr-user@lucene.apache.org Subject: Re: Facet Count and RegexTransformersplitBy Hi, What I want is very simple: The query results: row 1 = a,b,c,d row 2 = a,f,r,e row 3 = a,c,ff,e,b .. facet count needed: 'a' = 3 occurrence 'b' = 2 occur. 'c' = 2 occur. . . . I searched and found a solution here: http://stackoverflow.com/**questions/9914483/solr-facet-** multiple-words-with-comma-**separated-valueshttp://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values But I want to be sure if it will work. On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.com* *wrote: Facet counts are per field - your counts are scattered across different fields. There are additional capabilities in the facet component, but first you should describe exactly what your requirements are. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Facet Count and RegexTransformersplitBy Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: CRLF Invalid Exception ?
Thanks. I realized there's an error in the ChunkedInputFilter... I'm not sure if this means there's a bug in the client library I'm using (solrj 4.3) or is a bug in the server SOLR 4.3? Or is there something in my data that's causing the issue? On Fri, Sep 6, 2013 at 1:02 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Has anyone ever hit this when adding documents to SOLR? What does it mean? Always check for the root cause... : Caused by: java.io.IOException: Invalid CRLF : : at : org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352) ...so while Solr is trying to read XML off the InputStream from the client, an error is encountered by the ChunkedInputFilter. I suspect the client library you are using for the HTTP connection is claiming it's using chunking but isn't, or is doing something wrong with the chunking, or there is a bug in the ChunkedInputFilter. -Hoss
Re: Facet Count and RegexTransformersplitBy
basically, a field having a csv... and find counts / number of occurrance of each csv value.. On Fri, Sep 6, 2013 at 8:54 PM, Raheel Hasan raheelhasan@gmail.comwrote: Hi, What I want is very simple: The query results: row 1 = a,b,c,d row 2 = a,f,r,e row 3 = a,c,ff,e,b .. facet count needed: 'a' = 3 occurrence 'b' = 2 occur. 'c' = 2 occur. . . . I searched and found a solution here: http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values But I want to be sure if it will work. On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.comwrote: Facet counts are per field - your counts are scattered across different fields. There are additional capabilities in the facet component, but first you should describe exactly what your requirements are. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Facet Count and RegexTransformersplitBy Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: CRLF Invalid Exception ?
: Has anyone ever hit this when adding documents to SOLR? What does it mean? Always check for the root cause... : Caused by: java.io.IOException: Invalid CRLF : : at : org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352) ...so while Solr is trying to read XML off the InputStream from the client, an error is encountered by the ChunkedInputFilter. I suspect the client library you are using for the HTTP connection is claiming it's using chunking but isn't, or is doing something wrong with the chunking, or there is a bug in the ChunkedInputFilter. -Hoss
SOLR 4.x vs 3.x parsedquery differences
I'm migrating from 3.x to 4.x and I'm running some queries to verify that everything works like before. I've found however that the query galaxy s3 is giving much less results. In 3.x numFound=1628, in 4.x numFound=70. Here's the relevant schema part: fieldtype name=text_pt class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=- replacement=IIIHYPHENIII/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=IIIHYPHENIII replacement=-/ filter class=solr.ASCIIFoldingFilterFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 preserveOriginal=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=false words=portugueseStopWords.txt/ filter class=solr.BrazilianStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=- replacement=IIIHYPHENIII/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=IIIHYPHENIII replacement=-/ filter class=solr.ASCIIFoldingFilterFactory / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=portugueseSynonyms.txt expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 preserveOriginal=1 catenateNumbers=0 catenateAll=0 protected=protwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=false words=portugueseStopWords.txt/ filter class=solr.BrazilianStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer/fieldtype The synonyms involved in this query are: siii, s3 galaxy, galax My default search operator is AND (in both versions, even if it's deprecated in 4.x), and the output of the debug is: SOLR 3.x str name=parsedquery+(title_search_pt:galaxy title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s) 3)/str SOLR 4.x str name=parsedquery+((title_search_pt:galaxy title_search_pt:galax)/no_coord) +(+title_search_pt:sii +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str The weird thing is that it does not return results like 'galaxy s3'. This is the debug query: no match on required clause (+title_search_pt:sii +title_search_pt:s3 +title_search_pt:s +title_search_pt:3) (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s), *no match on required clause (title_search_pt:sii)* (NON-MATCH) no matching term (MATCH) weight(title_search_pt:s3 in 1834535) (MATCH) weight(title_search_pt:s in 1834535) (MATCH) weight(title_search_pt:3 in 1834535) How is that sii is *required* when it should be OR'ed with s and s3 ? The analysis output shows that sii has token position 2, like it's synonyms, like so: galaxy sii 3 galax s3 s Thanks, Raúl Cardozo.
Re: unknown _stream_source_info while indexing rich doc in solr
: it shows type as undefined for dynamic field ignored_* , and I am using That means the running solr instance does not know anything about a dynamic field named ignored_* -- it doesn't exist. : but on the admin page it shows schema : the page showing hte schema file just tells you what's on disk -- it has no way of knowing if you modified that file after starting up solr. ... Wait a minute ... i see your problem now... ... : /fields : dynamicField name=ignored_* type=ignored indexed=false stored=true : multiValued=true/ ...your dynamicField/ declaration needs to be inside your fields block. -Hoss
Re: SOLR 4.x vs 3.x parsedquery differences
: I'm migrating from 3.x to 4.x and I'm running some queries to verify that : everything works like before. I've found however that the query galaxy s3 : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70. is your entire schema 100% identical in both cases? what is the luceneMatchVersion set to in your solrconfig.xml? By the looks of your debug output, it appears that you are using autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x -- but the fieldType you posted here shows it set to false : fieldtype name=text_pt class=solr.TextField : positionIncrementGap=100 autoGeneratePhraseQueries=false ...i haven't tried to reproduce your specific situation, but that configuration doesn't smell right compared with what you are showing for the 3x output... : SOLR 3.x : : str name=parsedquery+(title_search_pt:galaxy : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s) : 3)/str : : SOLR 4.x : : str name=parsedquery+((title_search_pt:galaxy : title_search_pt:galax)/no_coord) +(+title_search_pt:sii : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str -Hoss
Re: CRLF Invalid Exception ?
For what it's worth... I just updated to solrj 4.4 (even though my server is solr 4.3) and it seems to have fixed the issue. Thanks for the help! On Fri, Sep 6, 2013 at 1:41 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I'm not sure if this means there's a bug in the client library I'm using : (solrj 4.3) or is a bug in the server SOLR 4.3? Or is there something in : my data that's causing the issue? It's unlikly that an error in the data you pass to SolrJ methods would be causing this problem -- i'm pretty sure it's not even a problem with the raw xml data being streamed, it appears to be a problem with how that data is getting shunked across the wire. My best guess is that the most likely causes are either... * a bug in the HttpClient versio you are using on the client side * a bug in the ChunkedInputFilter you are using on the server side * a misconfiguration on the HttpClient object you are using with SolrJ (ie: claiming it's sending chunked when it's not?) -Hoss
Re: SOLR 4.x vs 3.x parsedquery differences
Besides liking or not the behaviour we are getting in 3.x, Im required to keep everything working as close as possible as before. Have no idea why this is happening, but setting that field to true solved the issue, now I get the exact same amount of items in both queries! I wouldn't bother checking why that was so since we'll be moving away from the older version, which shows the inconsistency. But thanks a million. If you have a SO user I can mark yours as answer here: http://stackoverflow.com/questions/18661996/solr-4-x-vs-3-x-parsedquery-differences Cheers On Sep 6, 2013 4:15 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Our schema is identical except the version. : In 3.x it's 1.1 and in 4.x it's 1.5. That's kind of a significant difference to leave out -- indepenent of the question you are asking about here, it's going to make quite a few differences in how things are being being parsed, and what defaults are. If i'm understanding correctly: you like the behavior you are getting from Solr 3.x where phrases are generated automatically for you. what i can't understand, is how/why phrases are being generated automatically for you if you have that 'autoGeneratePhraseQueries=false' on your fieldType in your 3x schema ... that makes no sense to me. if you didn't have autoGeneratePhraseQueries specified at all, then the 'version=1.1' would explain it (up to version=1.3, the default for autoGeneratePhraseQueries was true, but in version=1.4 and above, it defaults to false) but with an explicit 'autoGeneratePhraseQueries=false' i can't explain why 3x works the way you say it works for you. Bottom line: if you *want* the auto generated phrase query behavior in 4.x, you should just set 'autoGeneratePhraseQueries=true' on your fieldType. : : I'm migrating from 3.x to 4.x and I'm running some queries to verify that : : everything works like before. I've found however that the query galaxy : s3 : : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70. : : is your entire schema 100% identical in both cases? : what is the luceneMatchVersion set to in your solrconfig.xml? : : : By the looks of your debug output, it appears that you are using : autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x -- : but the fieldType you posted here shows it set to false : : : fieldtype name=text_pt class=solr.TextField : : positionIncrementGap=100 autoGeneratePhraseQueries=false : : ...i haven't tried to reproduce your specific situation, but that : configuration doesn't smell right compared with what you are showing for : the 3x output... : : : SOLR 3.x : : : : str name=parsedquery+(title_search_pt:galaxy : : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s) : : 3)/str : : : : SOLR 4.x : : : : str name=parsedquery+((title_search_pt:galaxy : : title_search_pt:galax)/no_coord) +(+title_search_pt:sii : : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str : : : -Hoss : : -Hoss
Re: CRLF Invalid Exception ?
: I'm not sure if this means there's a bug in the client library I'm using : (solrj 4.3) or is a bug in the server SOLR 4.3? Or is there something in : my data that's causing the issue? It's unlikly that an error in the data you pass to SolrJ methods would be causing this problem -- i'm pretty sure it's not even a problem with the raw xml data being streamed, it appears to be a problem with how that data is getting shunked across the wire. My best guess is that the most likely causes are either... * a bug in the HttpClient versio you are using on the client side * a bug in the ChunkedInputFilter you are using on the server side * a misconfiguration on the HttpClient object you are using with SolrJ (ie: claiming it's sending chunked when it's not?) -Hoss
Re: SolrCloud 4.x hangs under high update volume
Hey guys, (copy of my post to SOLR-5216) We tested this patch and unfortunately encountered some serious issues a few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are writing about 5000 docs/sec total, using autoCommit to commit the updates (no explicit commits). Our environment: Solr 4.3.1 w/SOLR-5216 patch. Jetty 9, Java 1.7. 3 solr instances, 1 per physical server. 1 collection. 3 shards. 2 replicas (each instance is a leader and a replica). Soft autoCommit is 1000ms. Hard autoCommit is 15000ms. After about 6 hours of stress-testing this patch, we see many of these stalled transactions (below), and the Solr instances start to see each other as down, flooding our Solr logs with Connection Refused exceptions, and otherwise no obviously-useful logs that I could see. I did notice some stalled transactions on both /select and /update, however. This never occurred without this patch. Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. My script normalizes the ERROR-severity stack traces and returns them in order of occurrence. Summary of my solr.log: http://pastebin.com/pBdMAWeb Thanks! Tim Vaillancourt On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io wrote: Thanks! -Original message- From:Erick Erickson erickerick...@gmail.com Sent: Friday 6th September 2013 16:20 To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.x hangs under high update volume Markus: See: https://issues.apache.org/jira/browse/SOLR-5216 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Mark, Got an issue to watch? Thanks, Markus -Original message- From:Mark Miller markrmil...@gmail.com Sent: Wednesday 4th September 2013 16:55 To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.x hangs under high update volume I'm going to try and fix the root cause for 4.5 - I've suspected what it is since early this year, but it's never personally been an issue, so it's rolled along for a long time. Mark Sent from my iPhone On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it. Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/Connection Refused, and the shards become down in state. Sometimes a node or two survives and just returns 503s no server hosting shard errors. As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client - solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely. Our current environment is the following: - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. - 3 x Zookeeper instances, external Java 7 JVM. - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard). - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day. - 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads ulimit is 6000. - Occurs under Jetty 8 or 9 (many versions). - Occurs under Java 1.6 or 1.7 (several minor versions). - Occurs under several JVM tunings. - Everything seems to point to Solr itself, and not a Jetty or Java version (I hope I'm wrong). The stack trace that is holding up all my Jetty QTP threads is the following, which seems to be waiting on a lock that I would very much like to understand further: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x0007216e68d8 (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at
RE: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core
: Sorry for the multi-post, seems like the .tdump files didn't get : attached. I've tried attaching them as .txt files this time. Interesting ... it looks like 2 of your cores are blocked in loaded while waiting for the searchers to open ... not clera if it's a deaklock or why though - in both cases the coreLoaderThread is trying to register stuff with JMX, which is asking for stats right off the bat (not sure why), which requires accessing the searcher and is waiting for that to be available. but then you also have newSearcher listener events which are using the spellcheck componnent which is blocked waiting for that searcher as well. Do all of your cores have newSearcher event listners configured or just 2 (i'm trying to figure out if it's a timing fluke that these two are stalled, or if it's something special about the configs) Can you try removing the newSearcher listners to confirm that that does in fact make the problem go away? With the newSearcher listeners in place, Can you try setting spellcheck=false as a query param on the newSearcher listeners you have configured and see if that works arround the problem? Assuming it's just 2 cores using these listeners: can you reproduce this problem with a simpler seup where only one of the affected cores is in use? can you reproduce using Solr 4.4? It would be helpful if you could create a jira and attach... * your complete configs -- or at least some configs similar to yours that are complete enough to reproduce the startup problem. * some sample data (based on your initial description, i'm guessing there at least needs to be a handful of docs in the index -- and most likelye they need to match your warming query -- but we don't need your actual indexes, just some docs that will work with your configs that we can index restart to see the problem. * these thread dumps. -Hoss
Re: SOLR 4.x vs 3.x parsedquery differences
: Our schema is identical except the version. : In 3.x it's 1.1 and in 4.x it's 1.5. That's kind of a significant difference to leave out -- indepenent of the question you are asking about here, it's going to make quite a few differences in how things are being being parsed, and what defaults are. If i'm understanding correctly: you like the behavior you are getting from Solr 3.x where phrases are generated automatically for you. what i can't understand, is how/why phrases are being generated automatically for you if you have that 'autoGeneratePhraseQueries=false' on your fieldType in your 3x schema ... that makes no sense to me. if you didn't have autoGeneratePhraseQueries specified at all, then the 'version=1.1' would explain it (up to version=1.3, the default for autoGeneratePhraseQueries was true, but in version=1.4 and above, it defaults to false) but with an explicit 'autoGeneratePhraseQueries=false' i can't explain why 3x works the way you say it works for you. Bottom line: if you *want* the auto generated phrase query behavior in 4.x, you should just set 'autoGeneratePhraseQueries=true' on your fieldType. : : I'm migrating from 3.x to 4.x and I'm running some queries to verify that : : everything works like before. I've found however that the query galaxy : s3 : : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70. : : is your entire schema 100% identical in both cases? : what is the luceneMatchVersion set to in your solrconfig.xml? : : : By the looks of your debug output, it appears that you are using : autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x -- : but the fieldType you posted here shows it set to false : : : fieldtype name=text_pt class=solr.TextField : : positionIncrementGap=100 autoGeneratePhraseQueries=false : : ...i haven't tried to reproduce your specific situation, but that : configuration doesn't smell right compared with what you are showing for : the 3x output... : : : SOLR 3.x : : : : str name=parsedquery+(title_search_pt:galaxy : : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s) : : 3)/str : : : : SOLR 4.x : : : : str name=parsedquery+((title_search_pt:galaxy : : title_search_pt:galax)/no_coord) +(+title_search_pt:sii : : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str : : : -Hoss : : -Hoss
Re: SOLR 4.x vs 3.x parsedquery differences
On 9/6/2013 12:46 PM, Fermin Silva wrote: Our schema is identical except the version. In 3.x it's 1.1 and in 4.x it's 1.5. Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4 i believe) and in 4.x we fixed it to 4_4. The autoGeneratePhraseQueries parameter didn't exist before schema version 1.4. I'm fairly sure that for your schema that is at version 1.1, the autoGeneratePhraseQueries value specified in the field definition will be ignored and the actual value that gets used will be true, which goes along with what Hoss has said. See the comment about the version in the example schema on any 4.x Solr download. Thanks, Shawn
RE: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core
: Do all of your cores have newSearcher event listners configured or just : 2 (i'm trying to figure out if it's a timing fluke that these two are stalled, or if it's something special about the configs) All of my cores have both the newSearcher and firstSearcher event listeners configured. (The firstSearcher actually doesn't have any queries configured against it, so it probably should just be removed altogether) : Can you try removing the newSearcher listners to confirm that that does in fact make the problem go away? Removing the newSearcher listeners does not make the problem go away; however, removing the firstSearcher listener (even if the newSearcher listener is still configured) does make the problem go away. : With the newSearcher listeners in place, Can you try setting spellcheck=false as a query param on the newSearcher listeners you have configured and : see if that works arround the problem? Adding the spellcheck=false param to the firstSearcher listener does appear to work around the problem. : Assuming it's just 2 cores using these listeners: can you reproduce this problem with a simpler seup where only one of the affected cores is in use? Since it's not just these two cores, I'm not sure how to produce much of a simpler setup. I did attempt to limit how many cores are loaded in the solr.xml, and found that if I cut it down to 56, it was able to load successfully (without any of the above config changed). If I cut it down to 57 cores, it doesn't hang at registering core any more, it actually gets as far as QuerySenderListener sending requests to Searcher@2f28849 main{StandardDirectoryReader(... If 58+ cores are loaded at start up, that's when it begins to hang at registering core. However, it always hangs on the *last* core configured in the solr.xml, regardless of how many cores are being loaded. : can you reproduce using Solr 4.4? : It would be helpful if you could create a jira and attach... : * your complete configs -- or at least some configs similar to yours that are complete enough to reproduce the startup problem. : * some sample data (based on : your initial description, i'm guessing there at least needs to be a handful of docs in the index -- and most likelye they need to match your warming query -: - but we don't need your actual indexes, just some docs that will work with your configs that we can index : restart to see the problem. : * these thread dumps. I can likely get to this early next week, both checking into how this behaves using Solr 4.4 and submitting a JIRA with your requested info.
collections api setting dataDir
is there any way to change the dataDir while creating a collection via the collection api?
Re: SolrCloud 4.x hangs under high update volume
Did you ever get to index that long before without hitting the deadlock? There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system cannot keep up with the requests or something for some reason - perhaps due to some OS networking settings or something (more guessing). Connection refused happens generally when there is nothing listening on the port. Do you see anything interesting change with the rest of the system? CPU usage spikes or something like that? Clamping down further on the overall number of threads night help (which would require making something configurable). How many nodes are listed in zk under live_nodes? Mark Sent from my iPhone On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, (copy of my post to SOLR-5216) We tested this patch and unfortunately encountered some serious issues a few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are writing about 5000 docs/sec total, using autoCommit to commit the updates (no explicit commits). Our environment: Solr 4.3.1 w/SOLR-5216 patch. Jetty 9, Java 1.7. 3 solr instances, 1 per physical server. 1 collection. 3 shards. 2 replicas (each instance is a leader and a replica). Soft autoCommit is 1000ms. Hard autoCommit is 15000ms. After about 6 hours of stress-testing this patch, we see many of these stalled transactions (below), and the Solr instances start to see each other as down, flooding our Solr logs with Connection Refused exceptions, and otherwise no obviously-useful logs that I could see. I did notice some stalled transactions on both /select and /update, however. This never occurred without this patch. Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. My script normalizes the ERROR-severity stack traces and returns them in order of occurrence. Summary of my solr.log: http://pastebin.com/pBdMAWeb Thanks! Tim Vaillancourt On 6 September 2013 07:27, Markus Jelsma markus.jel...@openindex.io wrote: Thanks! -Original message- From:Erick Erickson erickerick...@gmail.com Sent: Friday 6th September 2013 16:20 To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.x hangs under high update volume Markus: See: https://issues.apache.org/jira/browse/SOLR-5216 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Mark, Got an issue to watch? Thanks, Markus -Original message- From:Mark Miller markrmil...@gmail.com Sent: Wednesday 4th September 2013 16:55 To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.x hangs under high update volume I'm going to try and fix the root cause for 4.5 - I've suspected what it is since early this year, but it's never personally been an issue, so it's rolled along for a long time. Mark Sent from my iPhone On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys, I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it. Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/Connection Refused, and the shards become down in state. Sometimes a node or two survives and just returns 503s no server hosting shard errors. As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client - solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely. Our current environment is the following: - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. - 3 x Zookeeper instances, external Java 7 JVM. - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard). - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day. - 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads
Re: SolrCloud 4.x hangs under high update volume
Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. Mark Sent from my iPhone On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no OutOfMemory: cannot open native thread errors that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to WARN-severity unless something breaks). Lastly, I found this stack trace occurring frequently, and have no idea what it is (may be useful or not): java.lang.IllegalStateException : at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) at org.eclipse.jetty.server.Response.sendError(Response.java:325) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:724) On your live_nodes question, I don't have historical data on this from when the crash occurred, which I guess is what you're looking for. I could add this to our monitoring for future tests, however. I'd be glad to continue further testing, but I think first more monitoring is needed to understand this further. Could we come up with a list of metrics that would be useful to see following another test and successful crash? Metrics needed: 1) # of live_nodes. 2) Full stack traces. 3) CPU used by Solr's JVM specifically (instead of system-wide). 4) Solr's JVM thread count (already done) 5) ? Cheers, Tim Vaillancourt On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote: Did you ever get to index that long before without hitting the deadlock? There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system cannot keep up with the requests or something for some reason - perhaps due to some OS networking settings or something (more guessing). Connection refused happens generally when there is nothing listening on the port. Do you see
Re: Odd behavior after adding an additional core.
hi, curl ' http://192.168.0.1:8983/solr/admin/collections?action=CREATEname=collectionxnumShards=4replicationFactor=1collection.configName=config1 ' after that, i added approx 100k documents, verified there were in the index and distributed across the shards. i then decided to start adding some replicas via coreadmin. curl ' http://192.168.0.1:8983/solr/admin/cores?action=CREATEname=collectionx_ex_replica1collection=collectionxcollection.configName=config1 ' adding the core, produced the following, it took away leader status from the leader on the shard it was replicating, inserted itself as down. changed the doc routing to implicit. Thanks. On Fri, Sep 6, 2013 at 4:24 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Can you give exact steps to reproduce this problem? Also, are you sure you supplied numShards=4 while creating the collection? On Fri, Sep 6, 2013 at 12:20 AM, mike st. john mstj...@gmail.com wrote: using solr 4.4 , i used collection admin to create a collection 4shards replication - factor of 1 i did this so i could index my data, then bring in replicas later by adding cores via coreadmin i added a new core via coreadmin, what i noticed shortly after adding the core, the leader of the shard where the new replica was placed was marked active the new core marked as the leader and the routing was now set to implicit. i've replicated this on another solr setup as well. Any ideas? Thanks msj -- Regards, Shalin Shekhar Mangar.
Re: unknown _stream_source_info while indexing rich doc in solr
it shows type as undefined for dynamic field ignored_* , and I am using default collection1 core, but on the admin page it shows schema : fields field name=id type=string indexed=true stored=true required=true multiValued=false/ field name=author type=string indexed=true stored=true multiValued=true/ field name=comments type=text indexed=true stored=true multiValued=false/ field name=keywords type=text indexed=true stored=true multiValued=false/ field name=contents type=string indexed=true stored=true multiValued=false/ field name=title type=text indexed=true stored=true multiValued=false/ field name=revision_number type=string indexed=true stored=true multiValued=false/ /fields dynamicField name=ignored_* type=ignored indexed=false stored=true multiValued=true/ types -- View this message in context: http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088591.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud 4.x hangs under high update volume
Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no OutOfMemory: cannot open native thread errors that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to WARN-severity unless something breaks). Lastly, I found this stack trace occurring frequently, and have no idea what it is (may be useful or not): java.lang.IllegalStateException : at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) at org.eclipse.jetty.server.Response.sendError(Response.java:325) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:724) On your live_nodes question, I don't have historical data on this from when the crash occurred, which I guess is what you're looking for. I could add this to our monitoring for future tests, however. I'd be glad to continue further testing, but I think first more monitoring is needed to understand this further. Could we come up with a list of metrics that would be useful to see following another test and successful crash? Metrics needed: 1) # of live_nodes. 2) Full stack traces. 3) CPU used by Solr's JVM specifically (instead of system-wide). 4) Solr's JVM thread count (already done) 5) ? Cheers, Tim Vaillancourt On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote: Did you ever get to index that long before without hitting the deadlock? There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system cannot keep up with the requests or something for some reason - perhaps due to some OS networking settings or something (more guessing). Connection refused happens generally when there is nothing listening on the port. Do you see anything interesting change with the rest of the system? CPU usage spikes or something like that? Clamping down further on the overall number of threads night help (which would require making something configurable). How many nodes are listed in zk under live_nodes? Mark Sent from my iPhone On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey guys,
Re: SolrCloud 4.x hangs under high update volume
Enjoy your trip, Mark! Thanks again for the help! Tim On 6 September 2013 14:18, Mark Miller markrmil...@gmail.com wrote: Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. Mark Sent from my iPhone On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no OutOfMemory: cannot open native thread errors that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to WARN-severity unless something breaks). Lastly, I found this stack trace occurring frequently, and have no idea what it is (may be useful or not): java.lang.IllegalStateException : at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) at org.eclipse.jetty.server.Response.sendError(Response.java:325) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:724) On your live_nodes question, I don't have historical data on this from when the crash occurred, which I guess is what you're looking for. I could add this to our monitoring for future tests, however. I'd be glad to continue further testing, but I think first more monitoring is needed to understand this further. Could we come up with a list of metrics that would be useful to see following another test and successful crash? Metrics needed: 1) # of live_nodes. 2) Full stack traces. 3) CPU used by Solr's JVM specifically (instead of system-wide). 4) Solr's JVM thread count (already done) 5) ? Cheers, Tim Vaillancourt On 6 September 2013 13:11, Mark Miller markrmil...@gmail.com wrote: Did you ever get to index that long before without hitting the deadlock? There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system
Re: solrcloud shards backup/restoration
Thanks Shalin and Mark for your responses. I am on the same page about the conventions for taking the backup. However, I am less sure about the restoration of the index. Lets say we have 3 shards across 3 solrcloud servers. 1. I am assuming we should take a backup from each of the shard leaders to get a complete collection. do you think that will get the complete index ( not worrying about what is not hard committed at the time of backup ). ? 2. How do we go about restoring the index in a fresh solrcloud cluster ? From the structure of the snapshot I took, I did not see any replication.properties or index.properties which I see normally on a healthy solrcloud cluster nodes. if I have the snapshot named snapshot.20130905 does the snapshot.20130905/* go into data/index ? Thanks Aditya On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller markrmil...@gmail.com wrote: Phone typing. The end should not say don't hard commit - it should say do a hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 7:26 AM, Mark Miller markrmil...@gmail.com wrote: I don't know that it's too bad though - its always been the case that if you do a backup while indexing, it's just going to get up to the last hard commit. With SolrCloud that will still be the case. So just make sure you do a hard commit right before taking the backup - yes, it might miss a few docs in the tran log, but if you are taking a back up while indexing, you don't have great precision in any case - you will roughly get a snapshot for around that time - even without SolrCloud, if you are worried about precision and getting every update into that backup, you want to stop indexing and commit first. But if you just want a rough snapshot for around that time, in both cases you can still just don't hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: The replication handler's backup command was built for pre-SolrCloud. It takes a snapshot of the index but it is unaware of the transaction log which is a key component in SolrCloud. Hence unless you stop updates, commit your changes and then take a backup, you will likely miss some updates. That being said, I'm curious to see how peer sync behaves when you try to restore from a snapshot. When you say that you haven't been successful in restoring, what exactly is the behaviour you observed? On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hello, I was looking for a good backup / recovery solution for the solrcloud indexes. I am more looking for restoring the indexes from the index snapshot, which can be taken using the replicationHandler's backup command. I am looking for something that works with solrcloud 4.3 eventually, but still relevant if you tested with a previous version. I haven't been successful in have the restored index replicate across the new replicas, after I restart all the nodes, with one node having the restored index. Is restoring the indexes on all the nodes the best way to do it ? -- Regards, -Aditya Sakhuja -- Regards, Shalin Shekhar Mangar. -- Regards, -Aditya Sakhuja
Unknown attribute id in add:allowDups
Hello, I'm working with the Pecl package, with Solr 4.3.1. I have a doc defined in my schema where id is the uniqueKey, field name=id type=int indexed=true stored=true required=true multiValued=false / uniqueKeyid/uniqueKey I tried to add a doc to my index with the following code (simplified for the question): $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc-addField('id', 12345); $doc-addField('description', 'This is the content of the doc'); $updateResponse = $client-addDocument($doc); When I do this, the doc is not added to the index, and I get the following error in the logs in admin Unknown attribute id in add:allowDups However, I noticed that if I change the field to type string: field name=id type=string indexed=true stored=true required=true multiValued=false / ... $doc-addField('id', '12345'); the doc is added to the index, but I still get the error in the log. So first, I was wondering, is there some other way I should be setting this up so that id can be an int instead of a string? And then I was also wondering what this error is referring to. Is there some further way I need to define id? Or maybe define the uniqueKey differently? Any help would be much appreciated. Thanks, Brian
Re: SOLR 4.x vs 3.x parsedquery differences
Hi, Our schema is identical except the version. In 3.x it's 1.1 and in 4.x it's 1.5. Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4 i believe) and in 4.x we fixed it to 4_4. Thanks On Sep 6, 2013 3:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I'm migrating from 3.x to 4.x and I'm running some queries to verify that : everything works like before. I've found however that the query galaxy s3 : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70. is your entire schema 100% identical in both cases? what is the luceneMatchVersion set to in your solrconfig.xml? By the looks of your debug output, it appears that you are using autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x -- but the fieldType you posted here shows it set to false : fieldtype name=text_pt class=solr.TextField : positionIncrementGap=100 autoGeneratePhraseQueries=false ...i haven't tried to reproduce your specific situation, but that configuration doesn't smell right compared with what you are showing for the 3x output... : SOLR 3.x : : str name=parsedquery+(title_search_pt:galaxy : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s) : 3)/str : : SOLR 4.x : : str name=parsedquery+((title_search_pt:galaxy : title_search_pt:galax)/no_coord) +(+title_search_pt:sii : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str -Hoss
Re: Facet Count and RegexTransformersplitBy
Hi, What I want is very simple: The query results: row 1 = a,b,c,d row 2 = a,f,r,e row 3 = a,c,ff,e,b .. facet count needed: 'a' = 3 occurrence 'b' = 2 occur. 'c' = 2 occur. . . . I searched and found a solution here: http://stackoverflow.com/questions/9914483/solr-facet-multiple-words-with-comma-separated-values But I want to be sure if it will work. On Fri, Sep 6, 2013 at 8:20 PM, Jack Krupansky j...@basetechnology.comwrote: Facet counts are per field - your counts are scattered across different fields. There are additional capabilities in the facet component, but first you should describe exactly what your requirements are. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Friday, September 06, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Facet Count and RegexTransformersplitBy Hi guyz, Just a quick question: I have a field that has CSV values in the database. So I will use the DataImportHandler and will index it using RegexTransformer's splitBy attribute. However, since this is the first time I am doing it, I just wanted to be sure if it will work for Facet Count? For example: From query results (say this is the values in that field): row 1 = 1,2,3,4 row 2 = 1,4,5,3 row 3 = 2,1,20,66 . . . . so facet count will get me: '1' = 3 occurrence '2' = 2 occur. . . .and so on. -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: solrcloud shards backup/restoration
I wouldn't say I love this idea, but wouldn't it be safe to LVM snapshot the Solr index? I think this may even work on a live server, depending on some file I/O details. Has anyone tried this? An in-Solr solution sounds more elegant, but considering the tlog concern Shalin mentioned, I think this may work as an interim solution. Cheers! Tim On 6 September 2013 15:41, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Thanks Shalin and Mark for your responses. I am on the same page about the conventions for taking the backup. However, I am less sure about the restoration of the index. Lets say we have 3 shards across 3 solrcloud servers. 1. I am assuming we should take a backup from each of the shard leaders to get a complete collection. do you think that will get the complete index ( not worrying about what is not hard committed at the time of backup ). ? 2. How do we go about restoring the index in a fresh solrcloud cluster ? From the structure of the snapshot I took, I did not see any replication.properties or index.properties which I see normally on a healthy solrcloud cluster nodes. if I have the snapshot named snapshot.20130905 does the snapshot.20130905/* go into data/index ? Thanks Aditya On Fri, Sep 6, 2013 at 7:28 AM, Mark Miller markrmil...@gmail.com wrote: Phone typing. The end should not say don't hard commit - it should say do a hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 7:26 AM, Mark Miller markrmil...@gmail.com wrote: I don't know that it's too bad though - its always been the case that if you do a backup while indexing, it's just going to get up to the last hard commit. With SolrCloud that will still be the case. So just make sure you do a hard commit right before taking the backup - yes, it might miss a few docs in the tran log, but if you are taking a back up while indexing, you don't have great precision in any case - you will roughly get a snapshot for around that time - even without SolrCloud, if you are worried about precision and getting every update into that backup, you want to stop indexing and commit first. But if you just want a rough snapshot for around that time, in both cases you can still just don't hard commit and take a snapshot. Mark Sent from my iPhone On Sep 6, 2013, at 1:13 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: The replication handler's backup command was built for pre-SolrCloud. It takes a snapshot of the index but it is unaware of the transaction log which is a key component in SolrCloud. Hence unless you stop updates, commit your changes and then take a backup, you will likely miss some updates. That being said, I'm curious to see how peer sync behaves when you try to restore from a snapshot. When you say that you haven't been successful in restoring, what exactly is the behaviour you observed? On Fri, Sep 6, 2013 at 5:14 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hello, I was looking for a good backup / recovery solution for the solrcloud indexes. I am more looking for restoring the indexes from the index snapshot, which can be taken using the replicationHandler's backup command. I am looking for something that works with solrcloud 4.3 eventually, but still relevant if you tested with a previous version. I haven't been successful in have the restored index replicate across the new replicas, after I restart all the nodes, with one node having the restored index. Is restoring the indexes on all the nodes the best way to do it ? -- Regards, -Aditya Sakhuja -- Regards, Shalin Shekhar Mangar. -- Regards, -Aditya Sakhuja
Re: Solr Cloud hangs when replicating updates
Thanks a ton Mark. I have tried SOLR-4816 and it didn't help. But I will try Mark's patch next week, and see what happens. -Kevin On Thu, Sep 5, 2013 at 4:46 AM, Erick Erickson erickerick...@gmail.comwrote: If you run into this again, try a jstack trace. You should see evidence of being stuck in SolrCmdDistributor on a variable called semaphore... On current 4x this is around line 420. If you're using SolrJ, then SOLR-4816 is another thing to try. But Mark's patch would be best of all to test, If that doesn't fix it then the jstack suggestion would at least tell us if it's the issue we think it is. FWIW, Erick On Wed, Sep 4, 2013 at 12:51 PM, Mark Miller markrmil...@gmail.com wrote: It would be great if you could give this patch a try: http://pastebin.com/raw.php?i=aaRWwSGP - Mark On Wed, Sep 4, 2013 at 8:31 AM, Kevin Osborn kevin.osb...@cbsi.com wrote: Thanks. If there is anything I can do to help you resolve this issue, let me know. -Kevin On Wed, Sep 4, 2013 at 7:51 AM, Mark Miller markrmil...@gmail.com wrote: Ill look at fixing the root issue for 4.5. I've been putting it off for way to long. Mark Sent from my iPhone On Sep 3, 2013, at 2:15 PM, Kevin Osborn kevin.osb...@cbsi.com wrote: I was having problems updating SolrCloud with a large batch of records. The records are coming in bursts with lulls between updates. At first, I just tried large updates of 100,000 records at a time. Eventually, this caused Solr to hang. When hung, I can still query Solr. But I cannot do any deletes or other updates to the index. At first, my updates were going as SolrJ CSV posts. I have also tried local file updates and had similar results. I finally slowed things down to just use SolrJ's Update feature, which is basically just JavaBin. I am also sending over just 100 at a time in 10 threads. Again, it eventually hung. Sometimes, Solr hangs in the first couple of chunks. Other times, it hangs right away. These are my commit settings: autoCommit maxTime15000/maxTime maxDocs5000/maxDocs openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime3/maxTime /autoSoftCommit I have tried quite a few variations with the same results. I also tried various JVM settings with the same results. The only variable seems to be that reducing the cluster size from 2 to 1 is the only thing that helps. I also did a jstack trace. I did not see any explicit deadlocks, but I did see quite a few threads in WAITING or TIMED_WAITING. It is typically something like this: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x00074039a450 (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) at org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) at org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:139) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:474) at org.apache.solr.handler.loader.CSVLoaderBase.doAdd(CSVLoaderBase.java:395) at org.apache.solr.handler.loader.SingleThreadedCSVLoader.addDoc(CSVLoader.java:44) at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:364) at org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at
Batch Solr Server
Does anyone know if there is such a thing as a BatchSolrServer object in the solrj code? I am currently using the ConcurrentUpdateSolrServer, but it isn't doing quite what I expected. It will distribute the load of sending through the http client through different threads and manage the connections, but it does not package the documents in bundles. This can be done manually by calling solrServer.add(CollectionSolrInputDocument documents), which will create an UpdateRequest object for the entire collection. When the ConcurrentUpdateSolrServer gets to this UpdateRequest it will send all of the documents together in a single http call. What I want to be able to do is call solrServer.add(SolInputDocument document) and have the SolrServer grab the next batch (up to a specified size) and then create an UpdateRequest. This would reduce the number of individual Requests the SOLR servers have to handle as well as any per http call overhead incurred. Would this kind of functionality be worth while to anyone else? Should I create such a SolrServer object? -- View this message in context: http://lucene.472066.n3.nabble.com/Batch-Solr-Server-tp4088657.html Sent from the Solr - User mailing list archive at Nabble.com.