Re: Suggester - how to return exact match?
Hi, I'd like to clarify our use case a bit more. We want to return the exact search query as a suggestion only if it is present in the index. So in my example we would expect to get the suggestion foo for the query foo but no suggestion abc for the query abc (because abc is not in the dictionary). For me this use case seems quite common. Say, we have three products in our store: foo, foo 1, foo 2. If the user types foo in the product search, we want to suggest all our products in the dropdown. Is this something we can do with the Solr suggester? Mirko 2013/11/20 Developer bbar...@gmail.com May be there is a way to do this but it doesn't make sense to return the same search query as a suggestion (Search query is not a suggestion as it might or might not be present in the index). AFAIK you can use various look up algorithm to get the suggestion list and they lookup the terms based on the query value (some alogrithm implements fuzzy logic too). so searching Foo will return FooBar, Foo2 but not foo. You should fetch the suggestion only if the numfound is greater than 0 else you don't have any suggestion. -- View this message in context: http://lucene.472066.n3.nabble.com/Suggester-how-to-return-exact-match-tp4102203p4102259.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4
Hi,I am using solr4.4 with zookeeper 3.3.5. While i was checking for error conditions of my application, i came across a strange issue.Here is what i tried:I have three fields defined in my schemaa) UNIQUE_KEY - of type solr.TrieLongb) empId - of type Solr.TrieLongc) companyId - of type Solr.TrieLongHow Am i Indexing:I am indexing using SolrJ API. and the data for the indexing will be in a text file, delimited by | symbol. My Indexer java program will read the textfile lineby line, splits the data by | symbol and creates SolrInputdocument object (for every line of the file) and adds the fields with values (that it read from the file)Now, intentionally, in the data file, for unique_key, i had String values(instead of long value) . something like123AB|111|222Now, when i index this data, i am getting the below exception:*org.apache.solr.client.solrj.SolrServerException*: No live SolrServers available to handle this request*:[URL of my application]* at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at *[URL of my application] *returned non ok status:500, message:Internal Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264) But, when i correct the unique_key field data, but when i gave string data for other two long fields, i am getting a different exceptionorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [Error stating the field name for which it is mismathing]orrg.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at What is my question here:--During indexing, if solr finds, that for any field, if the fieldtype declared in schema is mismatching with the data that is being givem, then it should riase the same type of exception.But in the above case, if it finds a mismatch for Unique_key, it is raising SolrServerException. For all other fields, it is raising, RemoteSolrException(which is an unchecked exception). Is this a bug in solr or is there any reason for thowing different exception for the above two cases.Expecting a positive replyThanksRadha -- View this message in context: http://lucene.472066.n3.nabble.com/SolrServerException-while-adding-an-invalid-UNIQUE-KEY-in-solr-4-4-tp4102346.html Sent from the Solr - User mailing list archive at Nabble.com.
Best implementation for multi-price store?
Hi, I've been recently ask to implement an application to search products from several stores, each store having different prices and stock for the same product. So I have products that have the usual fields (name, description, brand, etc) and also number of units and price for each store. I must be able to filter for a given store and order by stock or price for that store. The application should also allow incresing the number of stores, fields depending of store and number of products without much work. The numbers for the application are more or less 100 stores and 7M products. I've been thinking of some ways of defining the index structure but I don't know wich one is better as I think each one has it's pros and cons. 1. *Each product-store as a document:* Denormalizing the information so for every product and store I have a different document. Pros are that I can filter and order without problems and that adding a new store-depending field is very easy. Cons are that the index goes from 7M documents to 700M and that most of the info is redundant as most of the fields are repeated among stores. 2. *Each field-store as a field:* For example for price I would have store1_price, store2_price, Pros are that the index stays at 7M documents, and I can still filter and sort by those fields. Cons are that I have to add some logic so if I filter by one store I order for the associated price field, and that number of fields increases as number of store-depending fields x number of stores. I don't know if having more fields affects performance, but adding new store-depending fields will increase the number of fields even more 3. *Join:* First time I read about solr joins thought it was the way to go in this case, but after reading a bit more and doing some tests I'm not so sure about it... Maybe I've done it wrong but I think it also denormalizes the info (So I will also havee 700M documents) and besides I can't order or filter by store fields. I must say my preferred option is number 2, so I don't duplicate information, I keep a relatively small number of documents and I can filter and sort by the store fields. However, my main concern here is I don't know if having too many fields in a document will be harmful to performance. Which one do you think is the best approach for this application? Is there a better approach that I have missed? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Parse eDisMax queries for keywords
Hi, We would like to implement special handling for queries that contain certain keywords. Our particular use case: In the example query Footitle season 1 we want to discover the keywords season , get the subsequent number, and boost (or filter for) documents that match 1 on field name=season. We have two fields in our schema: !-- titles contains titles -- field name=title type=text indexed=true stored=true multiValued=false/ fieldType name=text class=solr.TextField omitNorms=true analyzer charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !-- ... -- /analyzer /fieldType field name=season type=season_number indexed=true stored=false multiValued=false/ !-- season contains season numbers -- fieldType name=season_number class=solr.TextField omitNorms=true analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=.*(?:season) *0*([0-9]+).* replacement=$1/ /analyzer /fieldType Our idea was to use a Keyword tokenizer and a Regex on the season field to extract the season number from the complete query. However, we use a ExtendedDisMax query parser in our search handler: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=defTypeedismax/str str name=qf title season /str /lst /requestHandler The problem is that the eDisMax tokenizes the query, so that our field season receives the tokens [Foo, season, 1] without any order, instead of the complete query. How can we pass the complete query (untokenized) to the season field? We don't understand which tokenizer is used here and why our season field received tokens instead of the complete query. Or is there another approach to solve this use case with Solr? Thanks, Mirko
Re: facet method=enum and uninvertedfield limitations
What is the actual target speed you are pursuing? Is this for user suggestions or something of that sort? Content based suggestions with faceting and esp on 1.4 solr won't be lightning fast. Have you looked at TermsComponent? http://wiki.apache.org/solr/TermsComponent By shingles, which in the rest of the world are more commonly called ngrams, I meant a way of compressing the number of entities to iterate through. Let's say if you only store bigrams or trigrams and facet based on those (less in amount). Dmitry On Wed, Nov 20, 2013 at 6:10 PM, Lemke, Michael SZ/HZA-ZSW lemke...@schaeffler.com wrote: On Wednesday, November 20, 2013 7:37 AM, Dmitry Kan wrote: Thanks for your reply. Since you are faceting on a text field (is this correct?) you deal with a lot of unique values in it. Yes, this is a text field and we experimented with reducing the index. As I said in my original question the stripped down index had 178,000 terms and it (fc) still didn't work. Is number of terms the relevant quantity? So your best bet is enum method. Hm, yes, that works but I have to wait 4 minutes for the answer (with the original data). Not good. Also if you are on solr 4x try building doc values in the index: this suits faceting well. We are on Solr 1.4, so, no. Otherwise start from your spec once again. Can you use shingles instead? Possibly but I don't know shingles. Although I'd prefer to use our original index we are trying to build a specialized index just for this sort of query but still don't know what to look for. A query like q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0 would give me the top ten results containing 'word' and something starting with 'a'. That's what I want. An empty facet.prefix should also work. Eventually, the query will be more complex containing other fields and filter queries but the basic function should be exactly like this. How can we achieve this? Thanks, Michael On 19 Nov 2013 17:44, Lemke, Michael SZ/HZA-ZSW lemke...@schaeffler.com wrote: On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote: Judging from numerous replies this seems to be a tough question. Nevertheless, I'd really appreciate any help as we are stuck. We'd really like to know what in our index causes the facet.method=fc query to fail. Thanks, Michael On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote: On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael SZ/HZA-ZSW lemke...@schaeffler.com wrote: I am running into performance problems with faceted queries. If I do a q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0 I am getting an exception: org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT at org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384) at org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839) ... I understand it's got something to do with a 24bit limit somewhere in the code but I don't understand enough of it to be able to construct a specialized index that can be queried with facet.method=enum. You shouldn't need to do anything differently to try facet.method=enum (just replace facet.method=fc with facet.method=enum) This is true and facet.method=enum does work indeed. The problem is runtime. In particular queries with an empty facet.prefix= run many seconds if not minutes. I initially asked about this here: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E It was suggested that fc is much faster than enum and I'd like to test that. We are still fairly free to design the index such that it performs well. But to do that we need to understand what is killing it. You may also want to add the parameter facet.enum.cache.minDf=10 to lower memory usage by only usiing the filter cache for terms that match more than 100K docs. That helped a little, cut down my particular test from 10 sec to 5 sec. But still too slow. Mind you this is for an autosuggest feature. Thanks for your reply. Michael -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
RE: Best implementation for multi-price store?
Hi, I'd go with (2) also but using dynamic fields so you don't have to define all the storeX_price fields in your schema but rather just one *_price field. Then when you filter on store:store1 you'd know to sort with store1_price and so forth for units. That should be pretty straightforward. Hope that helps, Robi -Original Message- From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] Sent: Thursday, November 21, 2013 1:36 AM To: solr-user@lucene.apache.org Subject: Best implementation for multi-price store? Hi, I've been recently ask to implement an application to search products from several stores, each store having different prices and stock for the same product. So I have products that have the usual fields (name, description, brand, etc) and also number of units and price for each store. I must be able to filter for a given store and order by stock or price for that store. The application should also allow incresing the number of stores, fields depending of store and number of products without much work. The numbers for the application are more or less 100 stores and 7M products. I've been thinking of some ways of defining the index structure but I don't know wich one is better as I think each one has it's pros and cons. 1. *Each product-store as a document:* Denormalizing the information so for every product and store I have a different document. Pros are that I can filter and order without problems and that adding a new store-depending field is very easy. Cons are that the index goes from 7M documents to 700M and that most of the info is redundant as most of the fields are repeated among stores. 2. *Each field-store as a field:* For example for price I would have store1_price, store2_price, Pros are that the index stays at 7M documents, and I can still filter and sort by those fields. Cons are that I have to add some logic so if I filter by one store I order for the associated price field, and that number of fields increases as number of store-depending fields x number of stores. I don't know if having more fields affects performance, but adding new store-depending fields will increase the number of fields even more 3. *Join:* First time I read about solr joins thought it was the way to go in this case, but after reading a bit more and doing some tests I'm not so sure about it... Maybe I've done it wrong but I think it also denormalizes the info (So I will also havee 700M documents) and besides I can't order or filter by store fields. I must say my preferred option is number 2, so I don't duplicate information, I keep a relatively small number of documents and I can filter and sort by the store fields. However, my main concern here is I don't know if having too many fields in a document will be harmful to performance. Which one do you think is the best approach for this application? Is there a better approach that I have missed? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Indexing data to a specific collection in Solr 4.5.0
Hi all: I’m currently on a Solr 4.5.0 instance and running this tutorial, http://lucene.apache.org/solr/4_5_0/tutorial.html My question is specific to indexing data as proposed from this tutorial, $ java -jar post.jar solr.xml monitor.xml The tutorial advises to validate from your localhost, http://localhost:8983/solr/collection1/select?q=solrwt=xml However, what if my Solr core has both a collection1 and collection2, yet I desire the XML files to only be posted to collection2 only? If possible, please advise. Thanks, Mark IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
Re: Facet field query on subset of documents
Hi Erick, Thanks for the reply and sorry, my fault, wasn't clear enough. I was wondering if there was a way to remove terms that would always be zero (because the term came from a document that didn't match the filter query). Here's an example. I have a bunch of documents with fields 'manufacturer' and 'location'. If I set my filter query to manufacturer = Sony and all Sony documents had a value of 'Florida' for location, then I want 'Florida' NOT to show up in my facet field results. Instead, it shows up with a count of zero (and it'll always be zero because of my filter query). Using mincount = 1 doesn't solve my problem because I don't want it to hide zeroes that came from documents that actually pass my filter query. Does that make more sense? On Thu, Nov 21, 2013 at 4:36 PM, Erick Erickson erickerick...@gmail.comwrote: That's what faceting does. The facets are only tabulated for documents that satisfy they query, including all of the filter queries and anh other criteria. Otherwise, facet counts would be the same no matter what the query was. Or I'm completely misunderstanding your question... Best, Erick On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo luis.leb...@gmail.com wrote: Hi All, Is it possible to perform a facet field query on a subset of documents (the subset being defined via a filter query for instance)? I understand that facet pivoting might work, but it would require that the subset be defined by some field hierarchy, e.g. manufacturer - price (then only look at the results for the manufacturer I'm interested in). What if I wanted to define a more complex subset (where the name starts with A but ends with Z and some other field is greater than 5 and yet another field is not 'x', etc.)? Ideally I would then define a facet field constraining query to include only terms from documents that pass this query. Thanks, Luis
Periodic Slowness on Solr Cloud
I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I alleviate the issue without disabling NRT? Thanks Much! -Dave
RE: search with wildcard
I know it's documented that Lucene/Solr doesn't apply filters to queries with wildcards, but this seems to trip up a lot of users. I can also see why wildcards break a number of filters, but a number of filters (e.g. mapping charsets) could mostly or entirely work. The N-gram filter is another one that would be great to still run when there wildcards. If you indexed 4-grams and the query is a *testp*, you currently won't get any results; but the N-gram filter could have a wildcard mode that, in this case, would return just the first 4-gram as a token. Is this something you've considered? It would have to be enabled in the core network, but disabled by default for existing filters; then it could be enabled 1-by-1 for existing filters. Apologies if the dev list is a better place for this. Scott -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Thursday, November 21, 2013 8:40 AM To: solr-user@lucene.apache.org Subject: Re: search with wildcard Hi Adnreas, If you don't want to use wildcards at query time, alternative way is to use NGrams at indexing time. This will produce a lot of tokens. e.g. For example 4grams of your example : Supertestplan = supe uper pert erte rtes *test* estp stpl tpla plan Is that you want? By the way why do you want to search inside of words? filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=4/ On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch wrote: I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
Re: Facet field query on subset of documents
That's what faceting does. The facets are only tabulated for documents that satisfy they query, including all of the filter queries and anh other criteria. Otherwise, facet counts would be the same no matter what the query was. Or I'm completely misunderstanding your question... Best, Erick On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo luis.leb...@gmail.com wrote: Hi All, Is it possible to perform a facet field query on a subset of documents (the subset being defined via a filter query for instance)? I understand that facet pivoting might work, but it would require that the subset be defined by some field hierarchy, e.g. manufacturer - price (then only look at the results for the manufacturer I'm interested in). What if I wanted to define a more complex subset (where the name starts with A but ends with Z and some other field is greater than 5 and yet another field is not 'x', etc.)? Ideally I would then define a facet field constraining query to include only terms from documents that pass this query. Thanks, Luis
Re: Indexing data to a specific collection in Solr 4.5.0
add Durl=http://localhost:8983/solr/collection2/update when run post.jar, 此邮件发送自189邮箱 Reyes, Mark mark.re...@bpiedu.com wrote: Hi all: I’m currently on a Solr 4.5.0 instance and running this tutorial, http://lucene.apache.org/solr/4_5_0/tutorial.html My question is specific to indexing data as proposed from this tutorial, $ java -jar post.jar solr.xml monitor.xml The tutorial advises to validate from your localhost, http://localhost:8983/solr/collection1/select?q=solrwt=xml However, what if my Solr core has both a collection1 and collection2, yet I desire the XML files to only be posted to collection2 only? If possible, please advise. Thanks, Mark IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
search with wildcard
I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
Re: Periodic Slowness on Solr Cloud
How real time is NRT? In particular, what are you commit settings? And can you characterize periodic slowness? Queries that usually take 500ms not tail 10s? Or 1s? How often? How are you measuring? Details matter, a lot... Best, Erick On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote: I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I alleviate the issue without disabling NRT? Thanks Much! -Dave
Multiple similarity scores for the same text field
I have the following simplified setting: My schema contains one text field, named text. When I perform a query, I need to get the scores for the same text field but for different similarity functions (e.g. TFIDF, BM25..) and combine them externally using different weights. An obvious way to achieve this is to keep multiple copies of the text field in the schema for each similarity. I am wondering though whether there is a more space-efficient way of doing this. Thanks, Nikos
Re: Indexing data to a specific collection in Solr 4.5.0
you're leaving off the - in front of the D, -Durl. Try java -jar post.jar -help for a list of options available On Thu, Nov 21, 2013 at 12:04 PM, Reyes, Mark mark.re...@bpiedu.com wrote: So then, $ java -jar post.jar Durl=http://localhost:8983/solr/collection2/update solr.xml monitor.xml On 11/21/13, 8:14 AM, xiezhide xiezh...@gmail.com wrote: add Durl=http://localhost:8983/solr/collection2/update when run post.jar, 此邮件发送自189邮箱 Reyes, Mark mark.re...@bpiedu.com wrote: Hi all: I’m currently on a Solr 4.5.0 instance and running this tutorial, http://lucene.apache.org/solr/4_5_0/tutorial.html My question is specific to indexing data as proposed from this tutorial, $ java -jar post.jar solr.xml monitor.xml The tutorial advises to validate from your localhost, http://localhost:8983/solr/collection1/select?q=solrwt=xml However, what if my Solr core has both a collection1 and collection2, yet I desire the XML files to only be posted to collection2 only? If possible, please advise. Thanks, Mark IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments. IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
Re: Suggester - how to return exact match?
Might not be a perfect solution but you can use edgengram filter and copy all your field data to that field and use it for suggestion. fieldType name=text_autocomplete class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=250 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType http://localhost:8983/solr/core1/select?q=name:iphone The above query will return iphone iphone5c iphone4g -- View this message in context: http://lucene.472066.n3.nabble.com/Suggester-how-to-return-exact-match-tp4102203p4102521.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Periodic Slowness on Solr Cloud
Yes, more details… Solr version, which garbage collector, how does heap usage look, cpu, etc. - Mark On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com wrote: How real time is NRT? In particular, what are you commit settings? And can you characterize periodic slowness? Queries that usually take 500ms not tail 10s? Or 1s? How often? How are you measuring? Details matter, a lot... Best, Erick On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote: I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I alleviate the issue without disabling NRT? Thanks Much! -Dave
Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4
On 11/21/2013 1:57 AM, RadhaJayalakshmi wrote: Hi,I am using solr4.4 with zookeeper 3.3.5. While i was checking for error conditions of my application, i came across a strange issue.Here is what i tried:I have three fields defined in my schemaa) UNIQUE_KEY - of type solr.TrieLongb) empId - of type Solr.TrieLongc) companyId - of type Solr.TrieLongHow Am i Indexing:I am indexing using SolrJ API. and the data for the indexing will be in a text file, delimited by | symbol. My Indexer java program will read the textfile lineby line, splits the data by | symbol and creates SolrInputdocument object (for every line of the file) and adds the fields with values (that it read from the file)Now, intentionally, in the data file, for unique_key, i had String values(instead of long value) . something like123AB|111|222Now, when i index this data, i am getting the below exception:*org.apache.solr.client.solrj.SolrServerException*: No live SolrServers available to handle this request*:[URL of my application]* at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at *[URL of my application] *returned non ok status:500, message:Internal Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264) But, when i correct the unique_key field data, but when i gave string data for other two long fields, i am getting a different exceptionorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [Error stating the field name for which it is mismathing]orrg.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at What is my question here:--During indexing, if solr finds, that for any field, if the fieldtype declared in schema is mismatching with the data that is being givem, then it should riase the same type of exception.But in the above case, if it finds a mismatch for Unique_key, it is raising SolrServerException. For all other fields, it is raising, RemoteSolrException(which is an unchecked exception). Is this a bug in solr or is there any reason for thowing different exception for the above two cases.Expecting a positive replyThanksRadha The first exception is an error thrown directly from SolrJ. It was unable to find any server to deal with the request, so it threw its own SolrServerException wrapping the last RemoteSolrException (HTTP error 500) it received. The second exception happened in a different place. In this case, the request made it past the server-side uniqueKey handling code and into the code that handles other fields, which froim what I can see here returns a different error message and possibly a different HTTP code. Because it was different, SolrJ sent the RemoteSolrException up the chain to your application rather than catching and wrapping it in SolrServerException. I am not surprised to hear that you get a different error for invalid data in the uniqueKey field than you do in other fields. Because of its nature, it must be handled in a different code path. Thanks, Shawn
Re: Split shard and stream sub-shards to remote nodes?
Hi, On Wed, Nov 20, 2013 at 12:53 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: At the Lucene level, I think it would require a directory implementation which writes to a remote node directly. Otherwise, on the solr side, we must move the leader itself to another node which has enough disk space and then split the index. Hm what about taking the source shard, splitting it, and sending docs that come out of each sub-shards to a remote node at Solr level, as if these documents are just being added (i.e. nothing at Lucene level)? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Wed, Nov 20, 2013 at 8:37 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Do you think this is something that is actually implementable? If so, I'll open an issue. One use-case where this may come in handy is when the disk space is tight. If a shard is using 50% of the disk space on some node X, you can't really split that shard because the 2 new sub-shards will not fit on the local disk. Or is there some trick one could use in this situation? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Wed, Nov 20, 2013 at 6:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: No, it is not supported yet. We can't split to a remote node directly. The best bet is trigger a new leader election by unloading the leader node once all replicas are active. On Wed, Nov 20, 2013 at 1:32 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is it possible to perform a shard split and stream data for the new/sub-shards to remote nodes, avoiding persistence of new/sub-shards on the local/source node first? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: How to index X™ as #8482; (HTML decimal entity)
And this is the exact problem. Some characters are stored as entities, some are not. When it is time to display, what else needs escaped? At a minimum, you would have to always store as amp; to avoid escaping the leading ampersand in the entities. You could store every single character as a numeric entity. Or you could store every non-ASCII character as a numeric entity. Or every non-Latin1 character. Plus ampersand, of course. In these e-mails, we are distinguishing between ™ and trade;. How would you do that? By storing trade; as amp;trade;. To avoid all this double-think, always store text as Unicode code points, encoded with a standard Unicode method (UTF-8, etc.). When displaying, only make entities if the codepoints cannot be represented in the target character encoding. If you are sending things in US-ASCII, you will be sending lots of entities. A good encoding library has callbacks for characters that cannot be represented. You can use these callbacks to format out-of-charset codepoints as entities. I've done this in product code, it really works. Finally, if you don't believe me, believe the XML Infoset, where numeric entities are always interpreted as treated as Unicode codepoints. The other way to go insane is storing local time in the database. Always store UTC and convert at the edges. wunder On Nov 21, 2013, at 7:50 AM, Jack Krupansky j...@basetechnology.com wrote: Would you store a as #65; ? No, not in any case. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 8:56 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Is there a way to do this? -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Indexing data to a specific collection in Solr 4.5.0
此邮件发送自189邮箱 Reyes, Mark mark.re...@bpiedu.com wrote: Hi all: I’m currently on a Solr 4.5.0 instance and running this tutorial, http://lucene.apache.org/solr/4_5_0/tutorial.html My question is specific to indexing data as proposed from this tutorial, $ java -jar post.jar solr.xml monitor.xml The tutorial advises to validate from your localhost, http://localhost:8983/solr/collection1/select?q=solrwt=xml However, what if my Solr core has both a collection1 and collection2, yet I desire the XML files to only be posted to collection2 only? If possible, please advise. Thanks, Mark IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
Facet field query on subset of documents
Hi All, Is it possible to perform a facet field query on a subset of documents (the subset being defined via a filter query for instance)? I understand that facet pivoting might work, but it would require that the subset be defined by some field hierarchy, e.g. manufacturer - price (then only look at the results for the manufacturer I'm interested in). What if I wanted to define a more complex subset (where the name starts with A but ends with Z and some other field is greater than 5 and yet another field is not 'x', etc.)? Ideally I would then define a facet field constraining query to include only terms from documents that pass this query. Thanks, Luis
Re: Parse eDisMax queries for keywords
The query parser does its own tokenization and parsing before your analyzer tokenizer and filters are called, assuring that only one white space-delimited token is analyzed at a time. You're probably best off having an application layer preprocessor for the query that enriches the query in the manner that you're describing. Or, simply settle for a heuristic approach that may give you 70% of what you want using only existing Solr features on the server side. -- Jack Krupansky -Original Message- From: Mirko Sent: Thursday, November 21, 2013 5:30 AM To: solr-user@lucene.apache.org Subject: Parse eDisMax queries for keywords Hi, We would like to implement special handling for queries that contain certain keywords. Our particular use case: In the example query Footitle season 1 we want to discover the keywords season , get the subsequent number, and boost (or filter for) documents that match 1 on field name=season. We have two fields in our schema: !-- titles contains titles -- field name=title type=text indexed=true stored=true multiValued=false/ fieldType name=text class=solr.TextField omitNorms=true analyzer charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !-- ... -- /analyzer /fieldType field name=season type=season_number indexed=true stored=false multiValued=false/ !-- season contains season numbers -- fieldType name=season_number class=solr.TextField omitNorms=true analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=.*(?:season) *0*([0-9]+).* replacement=$1/ /analyzer /fieldType Our idea was to use a Keyword tokenizer and a Regex on the season field to extract the season number from the complete query. However, we use a ExtendedDisMax query parser in our search handler: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=defTypeedismax/str str name=qf title season /str /lst /requestHandler The problem is that the eDisMax tokenizes the query, so that our field season receives the tokens [Foo, season, 1] without any order, instead of the complete query. How can we pass the complete query (untokenized) to the season field? We don't understand which tokenizer is used here and why our season field received tokens instead of the complete query. Or is there another approach to solve this use case with Solr? Thanks, Mirko
Re: How to retain the original format of input document in search results in SOLR - Tomcat
Solr (actually Lucene) stores the input _exactly_ as it is entered, and returns it the same way. What you're seeing is almost certainly your display mechanism interpreting the results, whitespace is notoriously variable in terms of how it's displayed by various interpretations of the standard. For instance, HTML often just eats whitespace. On Thu, Nov 21, 2013 at 1:33 AM, ramesh py pyrames...@gmail.com wrote: Hi All, I am new to apache solr. Recently I could able to configure the solr with tomcat successfully. And its working fine except the format of the search results i.e., the format of the search results not displaying as like as input document. I am doing the below things 1. Indexing the xml file into solr 2. Format of the xml as below *doc* field name=*F1*some text/field field name=*F2* Title1: descriptions of the title Title2 : description of the title2 Title3 : description of title3 /field field name=*F3*some text /field /doc 3. After index, the results are displaying in the below format. *F1 : *some text *F2*: Title1: descriptions of the title Title2 : description of the title2 Title3 : description of title3 *F3*: some text *Expected Result :* *F1 : *some text *F2*: Title1: descriptions of the title Title2 : description of the title2 Title3 : description of title3 *F3*: some text If we see the F2 field, format id getting changed i.e., input format is of F2 field is line by line for each sub title, but in the result it displaying as single line. I would like to display the result like whenever any subtitle occurs in xml file for any field, that subtitle should display in the next line in the results. Can anyone please help on this. Thanks in advance. Regards, Ramesh p.y -- Ramesh P.Y pyrames...@gmail.com Mobile No:+91-9176361984
Re: search with wildcard
Hi Adnreas, If you don't want to use wildcards at query time, alternative way is to use NGrams at indexing time. This will produce a lot of tokens. e.g. For example 4grams of your example : Supertestplan = supe uper pert erte rtes *test* estp stpl tpla plan Is that you want? By the way why do you want to search inside of words? filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=4/ On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch wrote: I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
Re: How to index X™ as #8482; (HTML decimal entity)
OK - probably I should have said A,or #97; :) My point was just that there is not really anything special about special characters. On 11/21/2013 10:50 AM, Jack Krupansky wrote: Would you store a as #65; ? No, not in any case. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 8:56 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Is there a way to do this? -- Walter Underwood wun...@wunderwood.org
RE: search with wildcard
I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
Re: search with wildcard
You might be able to make use of the dictionary compound word filter, but you will have to build up a dictionary of words to use: http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html My e-book has some examples and a better description. -- Jack Krupansky -Original Message- From: Ahmet Arslan Sent: Thursday, November 21, 2013 11:40 AM To: solr-user@lucene.apache.org Subject: Re: search with wildcard Hi Adnreas, If you don't want to use wildcards at query time, alternative way is to use NGrams at indexing time. This will produce a lot of tokens. e.g. For example 4grams of your example : Supertestplan = supe uper pert erte rtes *test* estp stpl tpla plan Is that you want? By the way why do you want to search inside of words? filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=4/ On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch wrote: I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
Re: Periodic Slowness on Solr Cloud
Lots of questions. Okay. In digging a little deeper and looking at the config I see that nrtModetrue/nrtMode is commented out. I believe this is the default setting. So I don't know if NRT is enabled or not. Maybe just a red herring. I don't know what Garbage Collector we're using. In this test I'm running Solr 4.5.1 using Jetty from the example directory. The CPU on the 8 nodes all stay around 70% use during the test. The nodes have 28GB of RAM. Java is using about 6GB and the rest is being used by OS cache. To perform the test we're running 200 concurrent threads in JMeter. The threads hit HAProxy which loadbalances the requests among the nodes. Each query is for a random word out of a list of about 10,000 words. Some of the queries have faceting turned on. Because we're heavily loading the system the queries are returning quite slowly. For a simple search, the average response time was 300ms. The peak response time was 11,000ms. The spikes in latency seem to occur about every 2.5 minutes. I haven't spent that much time messing with SolrConfig, so most of the settings are the out-of-the-box defaults. Where should I start to look? Thanks so much! -Dave On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com wrote: Yes, more details… Solr version, which garbage collector, how does heap usage look, cpu, etc. - Mark On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com wrote: How real time is NRT? In particular, what are you commit settings? And can you characterize periodic slowness? Queries that usually take 500ms not tail 10s? Or 1s? How often? How are you measuring? Details matter, a lot... Best, Erick On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote: I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I alleviate the issue without disabling NRT? Thanks Much! -Dave
Re: How to index X™ as #8482; (HTML decimal entity)
there is not really anything special about special characters Well, the distinction was about named entities, which are indeed special. Besides, in general, for more sophisticated text processing, character types are a valid distinction. But all of this begs the question of the original question: I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Maybe the original poster could clarify the nature of their need. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 11:37 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) OK - probably I should have said A,or #97; :) My point was just that there is not really anything special about special characters. On 11/21/2013 10:50 AM, Jack Krupansky wrote: Would you store a as #65; ? No, not in any case. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 8:56 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Is there a way to do this? -- Walter Underwood wun...@wunderwood.org
Re: confirm subscribe to solr-user@lucene.apache.org
I confirm .
How to implement a conditional copyField working for partial updates ?
Hello, I'm using Solr 4.x. In my solr schema I have the following fields defined : field name=content type=text_general indexed=false stored=true multiValued=true / field name=all type=text_general indexed=true stored=false multiValued=true termVectors=true / field name=eng type=text_en indexed=true stored=false multiValued=true termVectors=true / field name=ita type=text_it indexed=true stored=false multiValued=true termVectors=true / field name=fre type=text_fr indexed=true stored=false multiValued=true termVectors=true / ... copyField source=content dest=all/ To fill in the language specific fields, I use a custom update processor chain, with a custom ConditionalCopyProcessor that copies content field into appropriate language field, depending on document language (as explained in http://wiki.apache.org/solr/UpdateRequestProcessor). Problem is this custom chain is applied on update request document, thus it works all right when inserting a new document, or updating the whole document, but I lose language specific fields when I do a partial update (as those fields are not stored, and as the request document contains only updated fields). I would avoid to set language specific fields to stored=true, as content field may hold big values. Is there a way to have solr execute my ConditionalCopyProcessor on the actual updated doc (the one resulting from solr retrieving all stored values and merging with update request values), and not on the request doc ? Thank a lot for your help. P. Lecuyer .
Re: How to index X™ as #8482; (HTML decimal entity)
Ah... now I understand your perspective - you have taken a narrow view of what text is. A broader view is that it can contain formatting and special entities as well, or rich text in general. My read is that it all depends on the nature of the application and its requirements, not a one size fits all approach. The four main approaches being pure ASCII, Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formatting and rich text. And let the app needs determine which is most appropriate for each piece of text. The goal of SGML and HTML is not to hard-wire the final presentation, but simply to preserve some level of source format and structure, and then apply final presentation formatting on top of that. Some apps may opt to store the same information in multiple formats, such as one for raw text search, one for basic display, and one for detail display. I'm more of a platform guy than an app-specific guy - give the app developer tools that they can blend to meet their own requirements (or interests or tastes.) But Solr users should make no mistake, SGML entities are a perfectly valid intermediate format for rich text. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Thursday, November 21, 2013 11:44 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) And this is the exact problem. Some characters are stored as entities, some are not. When it is time to display, what else needs escaped? At a minimum, you would have to always store as amp; to avoid escaping the leading ampersand in the entities. You could store every single character as a numeric entity. Or you could store every non-ASCII character as a numeric entity. Or every non-Latin1 character. Plus ampersand, of course. In these e-mails, we are distinguishing between ™ and trade;. How would you do that? By storing trade; as amp;trade;. To avoid all this double-think, always store text as Unicode code points, encoded with a standard Unicode method (UTF-8, etc.). When displaying, only make entities if the codepoints cannot be represented in the target character encoding. If you are sending things in US-ASCII, you will be sending lots of entities. A good encoding library has callbacks for characters that cannot be represented. You can use these callbacks to format out-of-charset codepoints as entities. I've done this in product code, it really works. Finally, if you don't believe me, believe the XML Infoset, where numeric entities are always interpreted as treated as Unicode codepoints. The other way to go insane is storing local time in the database. Always store UTC and convert at the edges. wunder On Nov 21, 2013, at 7:50 AM, Jack Krupansky j...@basetechnology.com wrote: Would you store a as #65; ? No, not in any case. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 8:56 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than
Re: How to index X™ as #8482; (HTML decimal entity)
I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Is there a way to do this? -- Walter Underwood wun...@wunderwood.org
Re: How to index X™ as #8482; (HTML decimal entity)
I know all about formatted text -- I worked at MarkLogic. That is why I mentioned the XML Infoset. Numeric entities are part of the final presentation, really, part of the encoding. They should never be stored. Always store the Unicode. Numeric and named entities are a convenience for tools and encodings that can't handle Unicode. That is all they are. wunder On Nov 21, 2013, at 9:02 AM, Jack Krupansky j...@basetechnology.com wrote: Ah... now I understand your perspective - you have taken a narrow view of what text is. A broader view is that it can contain formatting and special entities as well, or rich text in general. My read is that it all depends on the nature of the application and its requirements, not a one size fits all approach. The four main approaches being pure ASCII, Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formatting and rich text. And let the app needs determine which is most appropriate for each piece of text. The goal of SGML and HTML is not to hard-wire the final presentation, but simply to preserve some level of source format and structure, and then apply final presentation formatting on top of that. Some apps may opt to store the same information in multiple formats, such as one for raw text search, one for basic display, and one for detail display. I'm more of a platform guy than an app-specific guy - give the app developer tools that they can blend to meet their own requirements (or interests or tastes.) But Solr users should make no mistake, SGML entities are a perfectly valid intermediate format for rich text. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Thursday, November 21, 2013 11:44 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) And this is the exact problem. Some characters are stored as entities, some are not. When it is time to display, what else needs escaped? At a minimum, you would have to always store as amp; to avoid escaping the leading ampersand in the entities. You could store every single character as a numeric entity. Or you could store every non-ASCII character as a numeric entity. Or every non-Latin1 character. Plus ampersand, of course. In these e-mails, we are distinguishing between ™ and trade;. How would you do that? By storing trade; as amp;trade;. To avoid all this double-think, always store text as Unicode code points, encoded with a standard Unicode method (UTF-8, etc.). When displaying, only make entities if the codepoints cannot be represented in the target character encoding. If you are sending things in US-ASCII, you will be sending lots of entities. A good encoding library has callbacks for characters that cannot be represented. You can use these callbacks to format out-of-charset codepoints as entities. I've done this in product code, it really works. Finally, if you don't believe me, believe the XML Infoset, where numeric entities are always interpreted as treated as Unicode codepoints. The other way to go insane is storing local time in the database. Always store UTC and convert at the edges. wunder On Nov 21, 2013, at 7:50 AM, Jack Krupansky j...@basetechnology.com wrote: Would you store a as #65; ? No, not in any case. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 8:56 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject:
Re: Indexing data to a specific collection in Solr 4.5.0
So then, $ java -jar post.jar Durl=http://localhost:8983/solr/collection2/update solr.xml monitor.xml On 11/21/13, 8:14 AM, xiezhide xiezh...@gmail.com wrote: add Durl=http://localhost:8983/solr/collection2/update when run post.jar, 此邮件发送自189邮箱 Reyes, Mark mark.re...@bpiedu.com wrote: Hi all: I’m currently on a Solr 4.5.0 instance and running this tutorial, http://lucene.apache.org/solr/4_5_0/tutorial.html My question is specific to indexing data as proposed from this tutorial, $ java -jar post.jar solr.xml monitor.xml The tutorial advises to validate from your localhost, http://localhost:8983/solr/collection1/select?q=solrwt=xml However, what if my Solr core has both a collection1 and collection2, yet I desire the XML files to only be posted to collection2 only? If possible, please advise. Thanks, Mark IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments. IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
Re: How to index X™ as #8482; (HTML decimal entity)
Would you store a as #65; ? No, not in any case. -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Thursday, November 21, 2013 8:56 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) I have to agree w/Walter. Use unicode as a storage format. The entity encodings are for transfer/interchange. Encode/decode on the way in and out if you have to. Would you store a as #65; ? It makes it impossible to search for, for one thing. What if someone wants to search for the TM character? -Mike On 11/20/13 12:07 PM, Jack Krupansky wrote: AFAICT, it's not an extremely bad idea - using SGML/HTML as a format for storing text to be rendered. If you disagree - try explaining yourself. But maybe TM should be encoded as trade;. Ditto for other named SGML entities. -- Jack Krupansky -Original Message- From: Walter Underwood Sent: Wednesday, November 20, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. Storing Unicode characters as XML/HTML encoded character references is an extremely bad idea. wunder On Nov 20, 2013, at 5:01 AM, Jack Krupansky j...@basetechnology.com wrote: Any analysis filtering affects the indexed value only, but the stored value would be unchanged from the original input value. An update processor lets you modify the original input value that will be stored. -- Jack Krupansky -Original Message- From: Uwe Reh Sent: Wednesday, November 20, 2013 5:43 AM To: solr-user@lucene.apache.org Subject: Re: How to index X™ as ™ (HTML decimal entity) What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Is there a way to do this? -- Walter Underwood wun...@wunderwood.org
RE: Periodic Slowness on Solr Cloud
Dave you might want to connect JVisualVm and see if there's any pattern with latency and garbage collection. That's a frequent culprit for periodic hits in latency. More info here http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html There's a couple GC implementations in Java that can be tuned as needed With JvisualVM You can also add the mbeans plugin to get a ton of performance stats out of Solr that might help debug latency issues. Doug Sent from my Windows Phone From: Dave Seltzer Sent: 11/21/2013 8:42 PM To: solr-user@lucene.apache.org Subject: Re: Periodic Slowness on Solr Cloud Lots of questions. Okay. In digging a little deeper and looking at the config I see that nrtModetrue/nrtMode is commented out. I believe this is the default setting. So I don't know if NRT is enabled or not. Maybe just a red herring. I don't know what Garbage Collector we're using. In this test I'm running Solr 4.5.1 using Jetty from the example directory. The CPU on the 8 nodes all stay around 70% use during the test. The nodes have 28GB of RAM. Java is using about 6GB and the rest is being used by OS cache. To perform the test we're running 200 concurrent threads in JMeter. The threads hit HAProxy which loadbalances the requests among the nodes. Each query is for a random word out of a list of about 10,000 words. Some of the queries have faceting turned on. Because we're heavily loading the system the queries are returning quite slowly. For a simple search, the average response time was 300ms. The peak response time was 11,000ms. The spikes in latency seem to occur about every 2.5 minutes. I haven't spent that much time messing with SolrConfig, so most of the settings are the out-of-the-box defaults. Where should I start to look? Thanks so much! -Dave On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com wrote: Yes, more details… Solr version, which garbage collector, how does heap usage look, cpu, etc. - Mark On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com wrote: How real time is NRT? In particular, what are you commit settings? And can you characterize periodic slowness? Queries that usually take 500ms not tail 10s? Or 1s? How often? How are you measuring? Details matter, a lot... Best, Erick On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote: I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I alleviate the issue without disabling NRT? Thanks Much! -Dave
a function query of time, frequency and score.
Hi, guys. I indexed 1000 documents, which have fields like title, ptime and frequency. The title is a text fild, the ptime is a date field, and the frequency is a int field. Frequency field is ups and downs. say sometimes its value is 0, and sometimes its value is 999. Now, in my app, the query could work with function query well. The function query is implemented as the score multiplied by an decreased date-weight array. However, I have got no idea to add the frequency to this formula... so could someone give me a clue? Thanks again! sling -- View this message in context: http://lucene.472066.n3.nabble.com/a-function-query-of-time-frequency-and-score-tp4102531.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Periodic Slowness on Solr Cloud
Additional info on GC selection http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors If response time is more important than overall throughput and garbage collection pauses must be kept shorter than approximately one second, then select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one or two processors are available, consider using incremental mode, described below. I'm not entirely certain of the implications of GC tuning for SolrCloud. I imagine distributed searching is going to be as slow as the slowest core being queried. I'd also be curious as to the root-cause of any excess GC churn. It sounds like you're doing a ton of random queries. This probably creates a lot of evictions your caches. There's nothing really worth caching, so the caches fill up and empty frequently, causing a lot of heap activity. If you expect to have high-load and a ton of turnover in queries, then tuning down cache size might help minimize GC churn. Solr Meter is another great tool for your perf testing that can help get at some of these caching issues. It gives you some higher-level stats about cache eviction, etc. https://code.google.com/p/solrmeter/ -Doug On Thu, Nov 21, 2013 at 10:24 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Dave you might want to connect JVisualVm and see if there's any pattern with latency and garbage collection. That's a frequent culprit for periodic hits in latency. More info here http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html There's a couple GC implementations in Java that can be tuned as needed With JvisualVM You can also add the mbeans plugin to get a ton of performance stats out of Solr that might help debug latency issues. Doug Sent from my Windows Phone From: Dave Seltzer Sent: 11/21/2013 8:42 PM To: solr-user@lucene.apache.org Subject: Re: Periodic Slowness on Solr Cloud Lots of questions. Okay. In digging a little deeper and looking at the config I see that nrtModetrue/nrtMode is commented out. I believe this is the default setting. So I don't know if NRT is enabled or not. Maybe just a red herring. I don't know what Garbage Collector we're using. In this test I'm running Solr 4.5.1 using Jetty from the example directory. The CPU on the 8 nodes all stay around 70% use during the test. The nodes have 28GB of RAM. Java is using about 6GB and the rest is being used by OS cache. To perform the test we're running 200 concurrent threads in JMeter. The threads hit HAProxy which loadbalances the requests among the nodes. Each query is for a random word out of a list of about 10,000 words. Some of the queries have faceting turned on. Because we're heavily loading the system the queries are returning quite slowly. For a simple search, the average response time was 300ms. The peak response time was 11,000ms. The spikes in latency seem to occur about every 2.5 minutes. I haven't spent that much time messing with SolrConfig, so most of the settings are the out-of-the-box defaults. Where should I start to look? Thanks so much! -Dave On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com wrote: Yes, more details… Solr version, which garbage collector, how does heap usage look, cpu, etc. - Mark On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com wrote: How real time is NRT? In particular, what are you commit settings? And can you characterize periodic slowness? Queries that usually take 500ms not tail 10s? Or 1s? How often? How are you measuring? Details matter, a lot... Best, Erick On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote: I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I alleviate the issue without disabling NRT? Thanks Much! -Dave -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com
Re: Periodic Slowness on Solr Cloud
Thanks Doug! One thing I'm not clear on is how do I know if this is in-fact related to Garbage Collection. If you're right, and the cluster is only as slow as its slowest link, how do I determine that this is GC. Do I have to run the profiler on all eight nodes? Or is it a matter of turning on the correct logging and then watching and waiting. Thanks! -D On Thu, Nov 21, 2013 at 11:20 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Additional info on GC selection http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors If response time is more important than overall throughput and garbage collection pauses must be kept shorter than approximately one second, then select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one or two processors are available, consider using incremental mode, described below. I'm not entirely certain of the implications of GC tuning for SolrCloud. I imagine distributed searching is going to be as slow as the slowest core being queried. I'd also be curious as to the root-cause of any excess GC churn. It sounds like you're doing a ton of random queries. This probably creates a lot of evictions your caches. There's nothing really worth caching, so the caches fill up and empty frequently, causing a lot of heap activity. If you expect to have high-load and a ton of turnover in queries, then tuning down cache size might help minimize GC churn. Solr Meter is another great tool for your perf testing that can help get at some of these caching issues. It gives you some higher-level stats about cache eviction, etc. https://code.google.com/p/solrmeter/ -Doug On Thu, Nov 21, 2013 at 10:24 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Dave you might want to connect JVisualVm and see if there's any pattern with latency and garbage collection. That's a frequent culprit for periodic hits in latency. More info here http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html There's a couple GC implementations in Java that can be tuned as needed With JvisualVM You can also add the mbeans plugin to get a ton of performance stats out of Solr that might help debug latency issues. Doug Sent from my Windows Phone From: Dave Seltzer Sent: 11/21/2013 8:42 PM To: solr-user@lucene.apache.org Subject: Re: Periodic Slowness on Solr Cloud Lots of questions. Okay. In digging a little deeper and looking at the config I see that nrtModetrue/nrtMode is commented out. I believe this is the default setting. So I don't know if NRT is enabled or not. Maybe just a red herring. I don't know what Garbage Collector we're using. In this test I'm running Solr 4.5.1 using Jetty from the example directory. The CPU on the 8 nodes all stay around 70% use during the test. The nodes have 28GB of RAM. Java is using about 6GB and the rest is being used by OS cache. To perform the test we're running 200 concurrent threads in JMeter. The threads hit HAProxy which loadbalances the requests among the nodes. Each query is for a random word out of a list of about 10,000 words. Some of the queries have faceting turned on. Because we're heavily loading the system the queries are returning quite slowly. For a simple search, the average response time was 300ms. The peak response time was 11,000ms. The spikes in latency seem to occur about every 2.5 minutes. I haven't spent that much time messing with SolrConfig, so most of the settings are the out-of-the-box defaults. Where should I start to look? Thanks so much! -Dave On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller markrmil...@gmail.com wrote: Yes, more details… Solr version, which garbage collector, how does heap usage look, cpu, etc. - Mark On Nov 21, 2013, at 6:46 PM, Erick Erickson erickerick...@gmail.com wrote: How real time is NRT? In particular, what are you commit settings? And can you characterize periodic slowness? Queries that usually take 500ms not tail 10s? Or 1s? How often? How are you measuring? Details matter, a lot... Best, Erick On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer dselt...@tveyes.com wrote: I'm doing some performance testing against an 8-node Solr cloud cluster, and I'm noticing some periodic slowness. http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png I'm doing random test searches against an Alias Collection made up of four smaller (monthly) collections. Like this: MasterCollection |- Collection201308 |- Collection201309 |- Collection201310 |- Collection201311 The last collection is constantly updated. New documents are being added at the rate of about 3 documents per second. I believe the slowness may due be to NRT, but I'm not sure. How should I investigate this? If the slowness is related to NRT, how can I
Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4
Thanks Shawn for your response. So, from your email, it seems that unique_key validation is handled differently from other field validation. But what i am not very clear, is what the unique_key has to do with finding the live server? Becase if there is any mismatch in the unique_key, it is throwing SolrServerException saying No live servers found.. Because live servers are being sourced by clusterstate of zookeeper. so i feel the unique key is particular to a core/index. So looking to understand the nature of this exception. Please explain me how unique_key and live servers are related -- View this message in context: http://lucene.472066.n3.nabble.com/SolrServerException-while-adding-an-invalid-UNIQUE-KEY-in-solr-4-4-tp4102346p4102533.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Periodic Slowness on Solr Cloud
On 11/21/2013 6:41 PM, Dave Seltzer wrote: In digging a little deeper and looking at the config I see that nrtModetrue/nrtMode is commented out. I believe this is the default setting. So I don't know if NRT is enabled or not. Maybe just a red herring. I had never seen this setting before. The default is true. SolrCloud requires that it be set to true. Looks like it's a new parameter in 4.5, added by SOLR-4909. From what I can tell reading the issue, turning it off effectively disables soft commits. https://issues.apache.org/jira/browse/SOLR-4909 You've said that you are adding about 3 documents per second, but you haven't said anything about how often you are doing commits. Erick's question basically boils down to this: How quickly after indexing do you expect the changes to be visible on a search, and how often are you doing commits? Generally speaking (and ignoring the fact that nrtMode now exists), NRT is not something you enable, it's something you try to achieve, by using soft commits quickly and often, and by adjusting the configuration to make the commits go faster. If you are trying to keep the interval between indexing and document visibility down to less than a few seconds (especially if it's less than one second), then you are trying to achieve NRT. There's a lot of information on the following wiki page about performance problems. This specific link is to the last part of that page, which deals with slow commits: http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits I don't know what Garbage Collector we're using. In this test I'm running Solr 4.5.1 using Jetty from the example directory. If you aren't using any tuning parameters beyond setting the max heap, then you are using the default parallel collector. It's a poor choice for Solr unless your heap is very small. At 6GB, yours isn't very small. It's not particularly huge either, but not small. The CPU on the 8 nodes all stay around 70% use during the test. The nodes have 28GB of RAM. Java is using about 6GB and the rest is being used by OS cache. How big is your index? If it's larger than about 30 GB, you probably need more memory. If it's much larger than about 40 GB, you definitely need more memory. To perform the test we're running 200 concurrent threads in JMeter. The threads hit HAProxy which loadbalances the requests among the nodes. Each query is for a random word out of a list of about 10,000 words. Some of the queries have faceting turned on. That's a pretty high query load. If you want to get anywhere near top performance out of it, you'll want to have enough memory to fit your entire index into RAM. You'll also need to reduce the load introduced by indexing. A large part of the load from indexing comes from commits. Because we're heavily loading the system the queries are returning quite slowly. For a simple search, the average response time was 300ms. The peak response time was 11,000ms. The spikes in latency seem to occur about every 2.5 minutes. I would bet that you're having one or both of the following issues: 1) Garbage collection issues from one or more of the following: a) Heap too small. b) Using the default GC instead of CMS with tuning. 2) General performance issues from one or more of the following: a) Not enough cache memory for your index size. b) Too-frequent commits. c) Commits taking a lot of time and resources due to cache warming. With a high query and index load, any problems become magnified. I haven't spent that much time messing with SolrConfig, so most of the settings are the out-of-the-box defaults. The defaults are very good for small to medium indexes and low to medium query load. If you have a big index and/or high query load, you'll generally need to tune. Thanks, Shawn
Re: Best implementation for multi-price store?
Hi Robert, That was the idea, dynamic fields, so, as you said, it is easier to sort and filter. Besides, having dynamic fields it would be easier to add new stores, as I wouldn't have to modify the schema :) Thanks for the answer! 2013/11/21 Petersen, Robert robert.peter...@mail.rakuten.com Hi, I'd go with (2) also but using dynamic fields so you don't have to define all the storeX_price fields in your schema but rather just one *_price field. Then when you filter on store:store1 you'd know to sort with store1_price and so forth for units. That should be pretty straightforward. Hope that helps, Robi -Original Message- From: Alejandro Marqués Rodríguez [mailto: amarq...@paradigmatecnologico.com] Sent: Thursday, November 21, 2013 1:36 AM To: solr-user@lucene.apache.org Subject: Best implementation for multi-price store? Hi, I've been recently ask to implement an application to search products from several stores, each store having different prices and stock for the same product. So I have products that have the usual fields (name, description, brand, etc) and also number of units and price for each store. I must be able to filter for a given store and order by stock or price for that store. The application should also allow incresing the number of stores, fields depending of store and number of products without much work. The numbers for the application are more or less 100 stores and 7M products. I've been thinking of some ways of defining the index structure but I don't know wich one is better as I think each one has it's pros and cons. 1. *Each product-store as a document:* Denormalizing the information so for every product and store I have a different document. Pros are that I can filter and order without problems and that adding a new store-depending field is very easy. Cons are that the index goes from 7M documents to 700M and that most of the info is redundant as most of the fields are repeated among stores. 2. *Each field-store as a field:* For example for price I would have store1_price, store2_price, Pros are that the index stays at 7M documents, and I can still filter and sort by those fields. Cons are that I have to add some logic so if I filter by one store I order for the associated price field, and that number of fields increases as number of store-depending fields x number of stores. I don't know if having more fields affects performance, but adding new store-depending fields will increase the number of fields even more 3. *Join:* First time I read about solr joins thought it was the way to go in this case, but after reading a bit more and doing some tests I'm not so sure about it... Maybe I've done it wrong but I think it also denormalizes the info (So I will also havee 700M documents) and besides I can't order or filter by store fields. I must say my preferred option is number 2, so I don't duplicate information, I keep a relatively small number of documents and I can filter and sort by the store fields. However, my main concern here is I don't know if having too many fields in a document will be harmful to performance. Which one do you think is the best approach for this application? Is there a better approach that I have missed? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4
On 11/21/2013 9:51 PM, RadhaJayalakshmi wrote: Thanks Shawn for your response. So, from your email, it seems that unique_key validation is handled differently from other field validation. But what i am not very clear, is what the unique_key has to do with finding the live server? Becase if there is any mismatch in the unique_key, it is throwing SolrServerException saying No live servers found.. Because live servers are being sourced by clusterstate of zookeeper. so i feel the unique key is particular to a core/index. So looking to understand the nature of this exception. Please explain me how unique_key and live servers are related It's the HTTP error code, 500, which means internal server error. SolrJ interprets this to mean that there's something wrong with that server, which is what the HTTP protocol specification says it must do. That makes it try the next server. Because the problem is not actually a server issue, the next server returns the same error. This continues until it's tried them all and gives up. The validation for other fields returns a different error, one that SolrJ interprets as a problem with the request, so it doesn't try other servers. Strictly speaking, Solr probably should not return error 500 for unique key validation issues, which makes this a minor bug. The actual results are correct, because the update fails and the application is notified. If all possible exceptions are caught, then it all works correctly. Thanks, Shawn