Re: boosting words from specific list
I have not tried it but I would check the option of using the synonymFilter to duplicate certain query words . Anothe opt - you can detect these word at index time (eg. UpdateProcessor) to give these documents a document boost in case it fits your logic. Or even make a copy field that contains a whitelist words and query two fields each query - the original and the copyField. With debug query you'll be able to get the scores and adjust your boosts. Small issue, many solutions. Look what works for you Manuel
Re: Searching and highlighting ten's of fields
Right, it works! I was not aware of this functionality and being able to customize it by hl.requireFieldMatch param. Thanks
Searching and highlighting ten's of fields
Hello, I need to expose the search and highlighting capabilities over few tens of fields. The edismax's qf param makes it possible but the time performances for searching tens of words over tens of fields is problematic. I made a copyField (indexed, not stored) for these fields, which gives way better search performances but does not enable highlighting the original fields which are stored. Is there any way of searching this copyField and highlighting other fields with any of the highlight components? BTW, I need to keep the field structure so storing the copyField is not an alternative.
Re: Searching and highlighting ten's of fields
Current I use the classic but I can change my posting format in order to work with another highlighting component if that leads to any solution
Re: Searching and highlighting ten's of fields
The slowdown occurs during search, not highlighting. Having a disjunctive query with 50 terms running 20 different posting lists is a hard task. Harder than searching these 50 terms on a single (larger) posting list as in the copyField case. With the edismax qf param, sure, hl.fl=* works as it should. In the copyField case it does not as it is a non stored field. There are no higlights on non stored fields AFAIK. Is there a way to search the global copyField but highlight the original stored fields? On Wed, Jul 30, 2014 at 5:54 PM, Erick Erickson erickerick...@gmail.com wrote: Doesn't hl.fl work in this case? Or is highlighting the 10 fields the slowdown? Best, Erick On Wed, Jul 30, 2014 at 2:55 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Current I use the classic but I can change my posting format in order to work with another highlighting component if that leads to any solution
OCR - Saving multi-term position
Hello, Many of our indexed documents are scanned and OCR'ed documents. Unfortunately we were not able to improve much the OCR quality (less than 80% word accuracy) for various reasons, a fact which badly hurts the retrieval quality. As we use an open-source OCR, we think of changing every scanned term output to it's main possible variations to get a higher level of confidence. Is there any analyser that supports this kind of need or should I make up a syntax and analyser of my own, i.e the payload syntax? The quick brown fox -- The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4 Thanks, Manuel
Re: OCR - Saving multi-term position
Thanks for your answers Erick and Michael. The term confidence level is an OCR output metric which tells for every word what are the odds it's the actual scanned term. I wish the OCR prog to output all the suspected words that sum up to above ~90% of confidence it is the actual term instead of outputting a single word as default behaviour. I'm happy to hear this approach was used before, I will implement an analyser that indexes these terms in same position to enable positional queries. Hope it works on well. In case it does I will open up a Jira ticket for it. If anyone else has had experience with this use case I'd love hearing, Manuel On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson erickerick...@gmail.com wrote: Problem here is that you wind up with a zillion unique terms in your index, which may lead to performance issues, but you probably already know that :). I've seen situations where running it through a dictionary helps. That is, does each term in the OCR match some dictionary? Problem here is that it then de-values terms that don't happen to be in the dictionary, names for instance. But to answer your question: No, there really isn't a pre-built analysis chain that i know of that does this. Root issue is how to assign confidence? No clue for your specific domain. So payloads seem quite reasonable here. Happens there's a recent end-to-end example, see: http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/ Best, Erick On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I don't have first hand knowledge of how you implement that, but I bet a look at the WordDelimiterFilter would help you understand how to emit multiple terms with the same positions pretty easily. I've heard of this bag of word variants approach to indexing poor-quality OCR output before for findability reasons and I heard it works out OK. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, Many of our indexed documents are scanned and OCR'ed documents. Unfortunately we were not able to improve much the OCR quality (less than 80% word accuracy) for various reasons, a fact which badly hurts the retrieval quality. As we use an open-source OCR, we think of changing every scanned term output to it's main possible variations to get a higher level of confidence. Is there any analyser that supports this kind of need or should I make up a syntax and analyser of my own, i.e the payload syntax? The quick brown fox -- The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4 Thanks, Manuel
Re: Compression vs FieldCache for doc ids retrieval
Is the issue SOLR-5478 what you were looking for?
Re: Application of different stemmers / stopword lists within a single field
Why wouldn't you take advantage of your use case - the chars belong to different char classes. You can index this field to a single solr field (no copyField) and apply an analysis chain that includes both languages analysis - stopword, stemmers etc. As every filter should apply to its' specific language (e.g an arabic stemmer should not stem a lating word) you can make cross languages search on this single field. On Mon, Apr 28, 2014 at 5:59 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: If you can throw money at the problem: http://www.basistech.com/text-analytics/rosette/language-identifier/ . Language Boundary Locator at the bottom of the page seems to be part/all of your solution. Otherwise, specifically for English and Arabic, you could play with Unicode ranges to try detecting text blocks: 1) Create an UpdateRequestProcessor chain that a) clones text into field_EN and field_AR. b) applies regular expression transformations that strip English or Arabic unicode text range correspondingly, so field_EN only has English characters left, etc. Of course, you need to decide what you want to do with occasional EN or neutral characters happening in the middle of Arabic text (numbers: Arabic or Indic? brackets, dashes, etc). But if you just index text, it might be ok even if it is not perfect. c) deletes empty fields, just in case not all of them have mix language 2) Use eDismax to search over both fields, each with its own processor. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill timothy.d.h...@gmail.com wrote: This may not be a practically solvable problem, but the company I work for has a large number of lengthy mixed-language documents - for example, scholarly articles about Islam written in English but containing lengthy passages of Arabic. Ideally, we would like users to be able to search both the English and Arabic portions of the text, using the full complement of language-processing tools such as stemming and stopword removal. The problem, of course, is that these two languages co-occur in the same field. Is there any way to apply different processing to different words or paragraphs within a single field through language detection? Is this to all intents and purposes impossible within Solr? Or is another approach (using language detection to split the single large field into language-differentiated smaller fields, for example) possible/recommended? Thanks, Tim Hill
Indexing useful N-grams and adding payloads
Hi, I have a performance and scoring problem for phrase queries 1. Performance - phrase queries involving frequent terms are very slow due to the reading of large positions posting list. 2. Scoring - I want to control the boost of phrase and entity (in gazetteers) matches Indexing all terms as bi-grams and unigrams is out of question in my use case, so I plan indexing only the useful bi-grams. Part of it will be achieved by the CommonGram filter in which I put the frequent words, but I think of going one step further and indexing also every phrase query I have extracted from my query log and entity from my gazetteers To the latter (which are N-grams) I will also add a payload to control the boost. An example MappingCharFilter.txt would be: #phrase-query term1 term2 term3 = term1_term2_term3|1 #entity firstName lastName = firstName_lastName|2 One of the issues is that I have 100k-1M (depending on frequency) phrases/entities as above. I saw that MappingCharFilter is implemented as an FST, still I'm concerned that iterating on the charBuffer for long documents might cause problems. Has anyone faced a similar issue? Is this mapping implementation resonable during query time performance wise? Thanks in advance, Manuel
Using payloads for expanded query terms
Hello, I'm trying to handle a situation with taxonomy search - that is for each taxonomy I have a list of words with their boosts. These taxonomies are updated frequently so I retrieve these scored lists at query time from an external service. My expectation would be: q={!some_query_parser}Cities_France OR Cities_England = q=max(Paris^0.5 Lyon^0.4 La Defense^0.3) OR max(London^0.5, Oxford^4) Implementations possibilities I thought about: 1. An adapted synonym filter, where query term boosts are encoded as payloads. 2. Query parser that handles the term expansion and weighting. The main drawback is the fact it forces me to stick to my own query parser. 3. Building the query outside Solr. What would you recommand? Thanks, Manuel
Re: Solr 4.6.0: DocValues (distributed search)
In short, when running a distributed search every shard runs the query separately. Each shard's collector returns the topN (rows param) internal docId's of the matching documents. These topN docId's are converted to their uniqueKey in the BinaryResponseWriter and sent to the frontend core (the one the received the query). This conversion is implemented by a StoredFieldVisitor, meaning the uniqueKeys are read from their stored field and not from their docValues. As in our use-case we have a high row param, these conversions became a performance bottleneck. We implemented a user-cache that stores the shard's uniqueKey docValues, which is a [docId, uniqueKey] mapping. This eliminates the need of accessing the stored field for these frequent conversions. You can have a look at the patch. Feel free commenting https://issues.apache.org/jira/browse/SOLR-5478 Best, Manuel On Thu, Jan 9, 2014 at 7:33 PM, ku3ia dem...@gmail.com wrote: Today I setup a simple SolrCloud with tow shards. Seems the same. When I'm debugging a distributed search I can't catch a break-point at lucene codec file, but when I'm using faceted search everything looks fine - debugger stops. Can anyone help me with my question? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-6-0-DocValues-distributed-search-tp4110289p4110511.html Sent from the Solr - User mailing list archive at Nabble.com.
Sudden Solr crush after commit
In the last days one of my tomcat servlet, running only a Solr instance, crushed unexpectedly twice. Low memory usage, nothing written in the tomcat log, and the last thing happening in solr log is 'end_commit_flush' followed by 'UnInverted mutli-valued field' for the fields faceted during the newsearcher run. Right after this, the tomcat crushed leaving no trace. Has anyone experienced a similar issue before? Thanks, Manu
Re: Updating shard range in Zookeeper
Zookeeper client for eclipse is the tool you're looking for. You can edit directly the clusterstate. http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper Another option can be using the delivered zkclient (distributed with solr 4.5 and above) and upload a new clusterstate with a new shard range. Good luck
Re: Sudden Solr crush after commit
Running solr 4.3, sharded collection. Tomcat 7.0.39 Faceting on multivalue fields works perfectly fine, I was describing this log to emphasize the fact the servlet failed right after a new searcher was opened and the event listener finished running a warming faceting query.
Re: Bad fieldNorm when using morphologic synonyms
In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
Robert, you last reply is not accurate. It's true that the field norms and termVectors are independent. But this issue of higher norms for this case is expected with well assigned positions. The LengthNorm is assigned as FieldInvertState.length which is the count of incrementToken and not num of positions! It is the case for wordDelimiterFilter or ReversedWildcardFilter which do change the norm when expanding a term.
Re: distributed search is significantly slower than direct search
https://issues.apache.org/jira/browse/SOLR-5478 There it goes On Mon, Nov 18, 2013 at 5:44 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Sure, I am out of office till end of week. I reply after i upload the patch
Re: distributed search is significantly slower than direct search
Sure, I am out of office till end of week. I reply after i upload the patch
Re: distributed search is significantly slower than direct search
In order to accelerate the BinaryResponseWriter.write we extended this writer class to implement the docid to id tranformation by docValues (on memory) with no need to access stored field for id reading nor lazy loading of fields that also has a cost. That should improve read rate as docValues are sequential and should avoid disk IO. This docValues implementation is accessed during both query stages (as mentioned above) in case you ask for id's only, or only once, during the distributed search stage, in case you intend asking for stored fields different than id. We just started testing it for performance. I would love hearing any oppinions or test performances for this implementation Manu
Re: distributed search is significantly slower than direct search
It's surprising such a query takes a long time, I would assume that after trying consistently q=*:* you should be getting cache hits and times should be faster. Try see in the adminUI how do your query/doc cache perform. Moreover, the query in itself is just asking the first 5000 docs that were indexed (returing the first [docid]), so seems all this time is wasted on transfer. Out of these 7 secs how much is spent on the above method? What do you return by default? How big is every doc you display in your results? Might be the matter that both collections work on the same ressources. Try elaborating your use-case. Anyway, it seems like you just made a test to see what will be the performance hit in a distributed environment so I'll try to explain some things we encountered in our benchmarks, with a case that has at least the similarity of the num of docs fetched. We reclaim 2000 docs every query, running over 40 shards. This means every shard is actually transfering to our frontend 2000 docs every document-match request (the first you were referring to). Even if lazily loaded, reading 2000 id's (on 40 servers) and lazy loading the fields is a tough job. Waiting for the slowest shard to respond, then sorting the docs and reloading (lazy or not) the top 2000 docs might take a long time. Our times are 4-8 secs, but do it's not possible comparing cases. We've done few steps that improved it along the way, steps that led to others. These were our starters: 1. Profile these queries from different servers and solr instances, try putting your finger what collection is working hard and why. Check if you're stuck on components that don't have an added value for you but are used by default. 2. Consider eliminating the doc cache. It loads lots of (partly) lazy documents that their probability of secondary usage is low. There's no such thing popular docs when requesting so many docs. You may be using your memory in a better way. 3. Bottleneck check - inner server metrics as cpu user / iowait, packets transferred over the network, page faults etc. are excellent in order to understand if the disk/network/cpu is slowing you down. Then upgrade hardware in one of the shards to check if it helps by looking at the upgraded shard qTime compared to other. 4. Warm up the index after commiting - try to benchmark how do queries performs before and after some warm-up, let's say some few hundreds of queries (from your previous system) in order to warm up the os cache (assuming your using NRTDirectoryFactory) Good luck, Manu On Wed, Nov 13, 2013 at 2:38 PM, Erick Erickson erickerick...@gmail.comwrote: One thing you can try, and this is more diagnostic than a cure, is return just the id field (and insure that lazy field loading is true). That'll tell you whether the issue is actually fetching the document off disk and decompressing, although frankly that's unlikely since you can get your 5,000 rows from a single machine quickly. The code you found where Solr is spending its time, is that on the routing core or on the shards? I actually have a hard time understanding how that code could take a long time, doesn't seem right. You are transferring 5,000 docs across the network, so it's possible that your network is just slow, that's certainly a difference between the local and remote case, but that's a stab in the dark. Not much help I know, Erick On Wed, Nov 13, 2013 at 2:52 AM, Elran Dvir elr...@checkpoint.com wrote: Erick, Thanks for your response. We are upgrading our system using Solr. We need to preserve old functionality. Our client displays 5K document and groups them. Is there a way to refactor code in order to improve distributed documents fetching? Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, October 30, 2013 3:17 AM To: solr-user@lucene.apache.org Subject: Re: distributed search is significantly slower than direct search You can't. There will inevitably be some overhead in the distributed case. That said, 7 seconds is quite long. 5,000 rows is excessive, and probably where your issue is. You're having to go out and fetch the docs across the wire. Perhaps there is some batching that could be done there, I don't know whether this is one document per request or not. Why 5K docs? Best, Erick On Tue, Oct 29, 2013 at 2:54 AM, Elran Dvir elr...@checkpoint.com wrote: Hi all, I am using Solr 4.4 with multi cores. One core (called template) is my routing core. When I run http://127.0.0.1:8983/solr/template/select?rows=5000q=*:*shards=127. 0.0.1:8983/solr/core1, it consistently takes about 7s. When I run http://127.0.0.1:8983/solr/core1/select?rows=5000q=*:*, it consistently takes about 40ms. I profiled the distributed query. This is the distributed query process (I hope the terms
Basic query process question with fl=id
Hi Any distributed lookup is basically composed of two stages: the first collecting all the matching documents from every shard and a second which fetches additional information about specific ids (i.e stored, termVectors). It can be seen in the logs of each shard (isShard=true), where first request logs the num of hits that were received on the query by the specific shard and a second that contains the ids fields (ids=...) for the additional fetch. At the end of both I get a total QTime of the query and the total num of hits. My question is about the case only id's are requested (fl=id). This query should make only one request against a shard, while it actually does the two of them. Looks like the response builder has to go through these two stages no matter what is the kind of query. My question: 1. Is it normal the response builder has to go though both stages? 2. Does the first request gets internal lucene DocId's or the actual uniqueKey id? 3. A query as above (fl=id), where is the Id read from? Is it fetched from the stored file? or doc value file if exists? Because if fetched from the stored, a high row param (say 1000 in my case) would need 1000 lookups which could badly heart performance. Thanks Manuel
Re: Profiling Solr Lucene for query
I tried my last proposition, editing the clusterstate.json to add a dummy frontend shard seems to work. I made sure the ranges were not overlapping. Doesn't it resolve the solr cloud issue as specified above?
Re: Profiling Solr Lucene for query
Would adding a dummy shard instead of a dummy collection would resolve the situation? - e.g. editing clusterstate.json from a zookeeper client and adding a shard with a 0-range so no docs are routed to this core. This core would be on a separate server and act as the collection gateway.
Re: Profiling Solr Lucene for query
Dmitry - currently we don't have such a front end, this sounds like a good idea creating it. And yes, we do query all 36 shards every query. Mikhail - I do think 1 minute is enough data, as during this exact minute I had a single query running (that took a qtime of 1 minute). I wanted to isolate these hard queries. I repeated this profiling few times. I think I will take the termInterval from 128 to 32 and check the results. I'm currently using NRTCachingDirectoryFactory On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi Manuel, The frontend solr instance is the one that does not have its own index and is doing merging of the results. Is this the case? If yes, are all 36 shards always queried? Dmitry On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi Dmitry, I have solr 4.3 and every query is distributed and merged back for ranking purpose. What do you mean by frontend solr? On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote: are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Re: Profiling Solr Lucene for query
Hi Dmitry, I have solr 4.3 and every query is distributed and merged back for ranking purpose. What do you mean by frontend solr? On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote: are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Re: Expunge deleting using excessive transient disk space
I can only agree for the 50% free space recommendation. Unfortunately I do not have this for the current time, I'm standing on a 10% free disk (out of 300GB for each server). I'm aware it is very low. Does this seem reasonable adapting the current merge policy (or writing a new one) that would free up the transient disk space every merge instead of waiting for all of them to achieve? Where can I get such a answer (people who wrote the code)? Thanks On Sun, Sep 8, 2013 at 9:30 PM, Erick Erickson erickerick...@gmail.comwrote: Right, but you should have at least as much free space as your total index size, and I don't see the total index size (but I'm just glancing). I'm not entirely sure you can precisely calculate the maximum free space you have relative to the amount needed for merging, some of the people who wrote that code can probably tell you more. I'd _really_ try to get more disk space. The amount of engineer time spent trying to tune this is way more expensive than a disk... Best, Erick On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double name=maxMergedSegmentMB5000.0/double double name=reclaimDeletesWeight10.0/double double name=segmentsPerTier15.0/double /mergePolicy The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this?
Expunge deleting using excessive transient disk space
Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double name=maxMergedSegmentMB5000.0/double double name=reclaimDeletesWeight10.0/double double name=segmentsPerTier15.0/double /mergePolicy The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this?
Profiling Solr Lucene for query
Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Wrong leader election leads to shard removal
Hello, My solr cluster runs on RH Linux with tomcat7 servlet. NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr 4.3 For experimental reasons I splitted my cluster to 2 sub-clusters, each containing a single replica of each shard. When connecting back these sub-clusters the sync failed (more than 100 docs indexed per shard) so a replication process started on sub-cluster #2. Due to transient storage limitations needed for the replication process, I removed all the index from sub-cluster #2 before connecting it back, then I connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load. The first bulk replications worked well, but after a while an internal script pkilled all the solr instances, some while replicating. After starting back the servlet I discovered the disaster - on part of the replicas that were in a replicating stage there was a wrong zookeeper leader election - good state replicas (sub-cluster 1) replicated from empty replicas (sub-cluster 2) ending up in removing all documents in these shards!! These are the logs from solr-prod32 (sub cluster #2 - bad state) - the shard1_replica1 is elected to be leader although it was not before the replication process (and shouldn't have the higher version number): 2013-08-13 13:39:15.838 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to continue. 2013-08-13 13:39:15.838 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader - try and sync 2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync replicas to http://solr-prod32:5050/solr/raw shard1_replica1/ 2013-08-13 13:39:15.841 [INFO ] org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client, config:maxConnectionsPerHost=20maxConnections=1connTimeout=3socketTimeout=3retry=false 2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync: core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[ http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100 2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync: core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have no versions. sync failed. 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with shard failed, moving to the next canidate 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader: http://solr-prod32:8080/solr/raw_shard1_replica1/ 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient makePath: /collections/raw/leaders/shardl 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) While in solr-prod02 (sub cluster #1 - good state) I get: 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController publishing core=raw_shard1_replica2 state=down 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController numShards not found on descriptor - reading it from system property 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer registering core: raw_shard1_replica2 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController Register replica - core:raw_shard1_replica2 address: http://so1r-prod02:8080/solrcollection:raw shard:shard1 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader A cluster state change: WatchedEvent stare:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) 2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is http://solr-prod32:8080/solr/raw_shard1_replica1/ 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No LogReplay needed for core=raw_shard1_replica2 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core needs to recover:raw shard1_replica2 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.update.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery 2013-08-13 13:39:17.485 [INFO org.apache.solr.common.cloud.ZkStateReader Updating cloud state from ZooKeeper... 2013-08-13 13:39:17.485 [INFO ] org.apache.solr.cloud.RecoveryStrategy Starting recovery process. core=raw_shard1_rep1ica2 Why was the leader elected wrongly?? Thanks
Re: Wrong leader election leads to shard removal
Does this sound like the scenario that happened: By removing the index dir from replica 2 I also removed the tlog from which the zookeeper extracts the version of the two replicas and decides which one should be elected to leader. As replica 2 did have no tlog, the zk didn't have anyway to compare the 2 registered replicas so it just picked arbitrarly one of the replicas to lead, resulting in electing empty replicas. How does the zookeeper compare the 2 tlogs to know which one is more recent? does it not rely on the version number shown in the admin UI? On Wed, Aug 14, 2013 at 11:00 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, My solr cluster runs on RH Linux with tomcat7 servlet. NumOfShards=40, replicationFactor=2, 40 servers each has 2 replicas. Solr 4.3 For experimental reasons I splitted my cluster to 2 sub-clusters, each containing a single replica of each shard. When connecting back these sub-clusters the sync failed (more than 100 docs indexed per shard) so a replication process started on sub-cluster #2. Due to transient storage limitations needed for the replication process, I removed all the index from sub-cluster #2 before connecting it back, then I connected sub-cluster #2's servers in 3-4 bulks to avoid high disk load. The first bulk replications worked well, but after a while an internal script pkilled all the solr instances, some while replicating. After starting back the servlet I discovered the disaster - on part of the replicas that were in a replicating stage there was a wrong zookeeper leader election - good state replicas (sub-cluster 1) replicated from empty replicas (sub-cluster 2) ending up in removing all documents in these shards!! These are the logs from solr-prod32 (sub cluster #2 - bad state) - the shard1_replica1 is elected to be leader although it was not before the replication process (and shouldn't have the higher version number): 2013-08-13 13:39:15.838 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext Enough replicas found to continue. 2013-08-13 13:39:15.838 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext I may be the new leader - try and sync 2013-08-13 13:39:15.839 [INFO ] org.apache.solr.cloud.SyncStrategy Sync replicas to http://solr-prod32:5050/solr/raw shard1_replica1/ 2013-08-13 13:39:15.841 [INFO ] org.apache.solr.client.solrj.impl.HttpClientUtil Creating new http client, config:maxConnectionsPerHost=20maxConnections=1connTimeout=3socketTimeout=3retry=false 2013-08-13 13:39:15.844 [INFO ] org.apache.solr.update.PeerSync PeerSync: core=raw_shard1_replica1 url=http://solr-prod32:8080/solr START replicas=[ http://solr-prod02:5080/solr/raw shard1_replica2/] nUpdates=100 2013-08-13 13:39:15.847 [INFO I org.apache.solr.update.PeerSync PeerSync: core=raw shard1_replica1 url=http://solr-prod32:8080/solr DONE. We have no versions. sync failed. 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with shard failed, moving to the next canidate 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.cloud.ShardLeaderElectionContext I am the new leader: http://solr-prod32:8080/solr/raw_shard1_replica1/ 2013-08-13 13:39:15.847 [INFO ] org.apache.solr.common.cloud.SolrZkClient makePath: /collections/raw/leaders/shardl 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) While in solr-prod02 (sub cluster #1 - good state) I get: 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController publishing core=raw_shard1_replica2 state=down 2013-08-13 13:39:15.671 [INFO ] org.apache.solr.cloud.ZkController numShards not found on descriptor - reading it from system property 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.core.CoreContainer registering core: raw_shard1_replica2 2013-08-13 13:39:15.673 [INFO ] org.apache.solr.cloud.ZkController Register replica - core:raw_shard1_replica2 address: http://so1r-prod02:8080/solr collection:raw shard:shard1 2013-08-13 13:39:17.423 [INFO ] org.apache.solr.common.cloud.ZkStateReader A cluster state change: WatchedEvent stare:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) 2013-08-13 13:39:17.480 [INFO ] org.apache.solr.cloud.ZkController We are httpL//solr-prod02:8080/solr/raw_shard1_replica2/ and leader is http://solr-prod32:8080/solr/raw_shard1_replica1/ 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController No LogReplay needed for core=raw_shard1_replica2 2013-08-13 13:39:17.481 [INFO ] org.apache.solr.cloud.ZkController Core needs
Merged segment warmer Solr 4.4
Hi, I have a slow storage machine and non sufficient RAM for the whole index to store all the index. This causes the first queries (~5000) to be very slow (they are read from disk and my cpu is most of time in iowait), and after that the readings from the index become very fast and read mainly from memory as the OS caching cached the most used parts of the index. My concern is about new segments that are commited to disk, either merged segments or newly formed segments. My first thought was to deal with linux caching policy (to factor up the caching of index files rather than uninverted files that are least frequently used) to urge the right OS caching without having to explicitly query the index for this to happen. Secondly I thought of initiating a new searcher event listener that queries on docs that were inserted since the last hard commit. A new ability of solr 4.4 (solr 4761) is to configure a mergedSegmentWarmer - how does this component work and is it good for my usecase? Are there any other ideas for dealing this usecase? What would be your proposal as most effective way to deal with it?
Re: SolrEntityProcessor gets slower and slower
Minfeng- This issue is tougher as the number of shard you have raise, you can read Erick Erickson's post: http://grokbase.com/t/lucene/solr-user/131p75p833/how-distributed-queries-works. If you have 100M docs I guess you are running this issue. The common way to deal with this issue is by filtering on a value that would return fewer results every query, as a creation_date field, and every query change this field range. For your data import use-case you might want to generate your data-import.xml with different entities, each one for another creation_date range. Thus no need for deep paging. Another option is using http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore. Implementing it in a multi sharded environment, as all your scores=1.0 thus results are ranked by shard (according to the internal [docId] of each shard), is not possible of my knowledge. Caching all the query results in each shard (by raising the queryResultWindow) should help, wouldn't it? Best, Manu On Mon, Jun 10, 2013 at 8:56 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: SolrEntityProcessor is fine for small amounts of data but not useful for such a large index. The problem is that deep paging in search results is expensive. As the start value for a query increases so does the cost of the query. You are much better off just re-indexing the data. On Mon, Jun 10, 2013 at 11:19 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I trying to migrate 100M documents from a solr index (v3.6) to a solrcloud index (v4.1, 4 shards) by using SolrEntityProcessor. My data-config.xml is like dataConfig document entity name=sep processor=SolrEntityProcessor url=http://10.64.35.117:8995/solr/; query=*:* rows=2000 fl= author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url / /document /dataConfig Initially, the data import rate is about 1K docs/second, but it eventually decrease to 20docs/second after running for tens of hours. Last time I tried data import with solorentityprocessor, the transfer rate can be as high as 3K docs/seconds. Anyone has any clues what can cause the slowdown? Thanks, Ming- -- Regards, Shalin Shekhar Mangar.
Re: Regex in Stopword.xml
Use the pattern replace filter factory filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement=/ This will do exactly what you asked for http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceFilterFactory On Mon, Jul 22, 2013 at 12:22 PM, Scatman alan.aron...@sfr.com wrote: Hi, I was looking for an issue, in order to put some regular expression in the StopWord.xml, but it seems that we can only have words in the file. I'm just wondering if there is a feature which will be done in this way or if someone got a tip it will help me a lot :) Best, Scatman. -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-in-Stopword-xml-tp4079412.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr caching clarifications
Great explanation and article. Yes, this buffer for merges seems very small, and still optimized. Thats impressive.
Re: Solr caching clarifications
Alright, thanks Erick. For the question about memory usage of merges, taken from Mike McCandless Blog The big thing that stays in RAM is a logical int[] mapping old docIDs to new docIDs, but in more recent versions of Lucene (4.x) we use a much more efficient structure than a simple int[] ... see https://issues.apache.org/jira/browse/LUCENE-2357 How much RAM is required is mostly a function of how many documents (lots of tiny docs use more RAM than fewer huge docs). A related clarification As my users are not aware of the fq possibility, i was wondering how do I make the best out of this field cache. Would if be efficient transforming implicitly their query to a filter query on fields that are boolean searches (date range etc. that do not affect the score of a document). Is this a good practice? Is there any plugin for a query parser that makes it? Inline On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you can get the maxDoc number from your Solr admin page). Plus some overhead for storing the fq text, but that's usually not much. This is for each entry up to Size. queryResultCache is usually trivial unless you've configured it extravagantly. It's the query string length + queryResultWindowSize integers per entry (queryResultWindowSize is from solrconfig.xml). 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? It's just a limit on the queryResultCache entry size as far as I can tell. But again this cache is relatively small, I'd be surprised if it used significant resources. 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. Yes. This a cache (I think) for the _contents_ of the documents you'll be returning to be manipulated by various components during the life of the query. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) Not sure, but I don't think this will contribute much to memory pressure. This is about now many fields are loaded to get a single value from a doc in the results list, and since one is usually working with 20 or so docs this is usually a small amount of memory. 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Again, I don't think this is much of a memory consumer, although I confess I don't know the internals. Merging is mostly about I/O. Thanks in advance, Manu But take a look at the admin page, you can see how much memory various caches are using by looking at the plugins/stats section. Best Erick
Solr caching clarifications
Hello, As a result of frequent java OOM exceptions, I try to investigate more into the solr jvm memory heap usage. Please correct me if I am mistaking, this is my understanding of usages for the heap (per replica on a solr instance): 1. Buffers for indexing - bounded by ramBufferSize 2. Solr caches 3. Segment merge 4. Miscellaneous- buffers for Tlogs, servlet overhead etc. Particularly I'm concerned by Solr caches and segment merges. 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet) and queryResultCaches (DocList)? I understand it is related to the skip spaces between doc id's that match (so it's not saved as a bitmap). But basically, is every id saved as a java int? 2. QueryResultMaxDocsCached - (for example = 100) means that any query resulting in more than 100 docs will not be cached (at all) in the queryResultCache? Or does it have to do with the documentCache? 3. DocumentCache - written on the wiki it should be greater than max_results*concurrent_queries. Max result is just the num of rows displayed (rows-start) param, right? Not the queryResultWindow. 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this cache be used? (on the expense of eviction of docs that were already loaded with stored fields) 5. How large is the heap used by mergings? Assuming we have a merge of 10 segments of 500MB each (half inverted files - *.pos *.doc etc, half non inverted files - *.fdt, *.tvd), how much heap should be left unused for this merge? Thanks in advance, Manu
Common practice for free text field
My schema contains about a hundred of fields of various types (int, strings, plain text, emails). I was concerned what is the common practice for searching free text over the index. Assuming there are not boosts related to field matching, these are the options I see: 1. Index and query a all_fields copyField source=* 1. advantages - only one query flow against a single index. 2. disadvantage - the tokenizing is not necessarily adapted to this kind of field, this requires more storage and memory 2. Field aliasing ( f.myalias.qf=realfield) 1. advantages - opposite from the above 2. disadvantages - a single query term would query 100 different fields. Multi term query might be a serious performance issue. Any common practices?
Re: Common practice for free text field
By field aliasing I meant something like: f.all_fields.qf=*_txt+*_s+*_int that would sum up to 100 fields On Wed, Jun 26, 2013 at 12:00 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: My schema contains about a hundred of fields of various types (int, strings, plain text, emails). I was concerned what is the common practice for searching free text over the index. Assuming there are not boosts related to field matching, these are the options I see: 1. Index and query a all_fields copyField source=* 1. advantages - only one query flow against a single index. 2. disadvantage - the tokenizing is not necessarily adapted to this kind of field, this requires more storage and memory 2. Field aliasing ( f.myalias.qf=realfield) 1. advantages - opposite from the above 2. disadvantages - a single query term would query 100 different fields. Multi term query might be a serious performance issue. Any common practices?
Parallel queries on a single core
Hello all, Assuming I have a single shard with a single core, how do run multi-threaded queries on Solr 4.x? Specifically, if one user sends a heavy query (legitimate wildcard query for 10 sec), what happens to all other users quering during this period? If the repsonse is that simultaneous queries (say 2) run multi-threaded, a single CPU would switch between those two query-threads, and in case of 2 CPU's each CPU would run his own thread. But the latter case does not give any advantage to repFactor 1 perfomance speaking, as it's close to same as a single replica running wth 1 CPU's. So I am bit confused about this, Thanks, Manu
Avoiding OOM fatal crash
Hello again, After a heavy query on my index (returning 100K docs in a single query) my JVM heap's floods and I get an JAVA OOM exception, and then that my GCcannot collect anything (GC overhead limit exceeded) as these memory chunks are not disposable. I want to afford queries like this, my concern is that this case provokes a total Solr crash, returning a 503 Internal Server Error while trying to * index.* Is there anyway to separate these two logics? I'm fine with solr not being able to return any response after returning this OOM, but I don't see the justification the query to flood JVM's internal (bounded) buffers for writings. Thanks, Manuel
Re: Avoiding OOM fatal crash
One of my users requested it, they are less aware of what's allowed and I don't want apriori blocking them for long specific request (there are other params that might end up OOMing me). I thought of timeAllowed restriction, but also this solution cannot guarantee during this delay I would not get the JVM heap flooded (for example I already have all cashed and my RAM io's are very fast) On Mon, Jun 17, 2013 at 11:47 PM, Walter Underwood wun...@wunderwood.orgwrote: Don't request 100K docs in a single query. Fetch them in smaller batches. wunder On Jun 17, 2013, at 1:44 PM, Manuel Le Normand wrote: Hello again, After a heavy query on my index (returning 100K docs in a single query) my JVM heap's floods and I get an JAVA OOM exception, and then that my GCcannot collect anything (GC overhead limit exceeded) as these memory chunks are not disposable. I want to afford queries like this, my concern is that this case provokes a total Solr crash, returning a 503 Internal Server Error while trying to * index.* Is there anyway to separate these two logics? I'm fine with solr not being able to return any response after returning this OOM, but I don't see the justification the query to flood JVM's internal (bounded) buffers for writings. Thanks, Manuel
Re: Parallel queries on a single core
Yes, that answers the first part of my question, thanks. So saying N (equally heavy) queries agains N CPUs would run simultaneously, right? Previous posting suggest high qps rate can be solved perfomance-wise by having high replicationFactor. But what's the benefit (performance wise) compared to having a single replica served by many CPU's? On Tue, Jun 18, 2013 at 12:14 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: If I understand your question correctly - what happens with Solr and N parallel queries is not much different from what happens with N processes running in the OS - they all get a slice of the CPU time to do their work. Not sure if that answers your question...? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:32 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all, Assuming I have a single shard with a single core, how do run multi-threaded queries on Solr 4.x? Specifically, if one user sends a heavy query (legitimate wildcard query for 10 sec), what happens to all other users quering during this period? If the repsonse is that simultaneous queries (say 2) run multi-threaded, a single CPU would switch between those two query-threads, and in case of 2 CPU's each CPU would run his own thread. But the latter case does not give any advantage to repFactor 1 perfomance speaking, as it's close to same as a single replica running wth 1 CPU's. So I am bit confused about this, Thanks, Manu
Re: Avoiding OOM fatal crash
Unfortunately my organisation's too big to control or teach every employee what are the limits, as well as they can vary (many facets - how much is ok?, asking for too many fields in proportion of too many rows etc) Don't you think it is preferable to commit the maxBufferSize in the JVM heap for indexing only? On Tue, Jun 18, 2013 at 12:11 AM, Walter Underwood wun...@wunderwood.orgwrote: Make them aware of what is required. Solr is not designed to return huge requests. If you need to do this, you will need to run the JVM with a big enough heap to build the request. You are getting OOM because the JVM does not have enough memory to build a response with 100K documents. wunder On Jun 17, 2013, at 1:57 PM, Manuel Le Normand wrote: One of my users requested it, they are less aware of what's allowed and I don't want apriori blocking them for long specific request (there are other params that might end up OOMing me). I thought of timeAllowed restriction, but also this solution cannot guarantee during this delay I would not get the JVM heap flooded (for example I already have all cashed and my RAM io's are very fast) On Mon, Jun 17, 2013 at 11:47 PM, Walter Underwood wun...@wunderwood.orgwrote: Don't request 100K docs in a single query. Fetch them in smaller batches. wunder On Jun 17, 2013, at 1:44 PM, Manuel Le Normand wrote: Hello again, After a heavy query on my index (returning 100K docs in a single query) my JVM heap's floods and I get an JAVA OOM exception, and then that my GCcannot collect anything (GC overhead limit exceeded) as these memory chunks are not disposable. I want to afford queries like this, my concern is that this case provokes a total Solr crash, returning a 503 Internal Server Error while trying to * index.* Is there anyway to separate these two logics? I'm fine with solr not being able to return any response after returning this OOM, but I don't see the justification the query to flood JVM's internal (bounded) buffers for writings. Thanks, Manuel
Re: Exceptions on startup shutdown for solr 4.3 on Tomcat 7
Ok! Will check eventually if it's an ACE issue and will upload the stack trace in case something else is throwing theses exceptions... Thanks meanwhile On Mon, May 13, 2013 at 12:11 AM, Shawn Heisey s...@elyograg.org wrote: On 5/12/2013 2:37 PM, Manuel Le Normand wrote: The upgrade from 4.2.1 to 4.3 on Tomcat 7 didn't go successfully, and I get many exceptions I didn't see in the earlier version. The services on different servers are up, I can access admin UI, create collections etc. but service startup and shutdown seem quite buggy. I tried reseting early configs but got back to the same situation. The given situation happens even on instances without any cores. On startup I get: org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet [default] in context with path [/solr] threw exception java.lang.IllegalStateException: Cannot call sendError() after the response has been committed at org.apache.catalina.connector.ResonseFacade.sendError(ResponseFacade.java:451) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFiletr.java:692) Things like this are not usually a problem in Solr, it's likely to be in tomcat settings. It's always possible that it might be a problem in Solr, but it's not likely. The following question/answer page may provide some insight. The settings that are mentioned in the answer on this page are likely just changing tomcat settings. http://forums.adobe.com/thread/1042921 To get more specific answers here, we'll need more info from your logs - ideally an entire fresh log. The best thing to do would be to shut tomcat down, move or delete your existing log, then start it back up. Once a new log is created that shows the problem, copy the entire file and make it available on the Internet. If it's relatively small (100k or so), use a paste website (pastie.org or your favorite). If it's pretty big, use a file sharing site like dropbox. If you need to sanitize your log to remove identifying info, do a consistent search/replace with a harmless string - don't delete entire lines, or it will be difficult to tell what's happening. Thanks, Shawn
Too many unique terms
Hi there, Looking at one of my shards (about 1M docs) i see lot of unique terms, more than 8M which is a significant part of my total term count. These are very likely useless terms, binaries or other meaningless numbers that come with few of my docs. I am totally fine with deleting them so these terms would be unsearchable. Thinking about it i get that 1. It is impossible apriori knowing if it is unique term or not, so i cannot add them to my stop words. 2. I have a performance decrease cause my cached chuncks do contain useless data, and im short on memory. Assuming a constant index, is there a way of deleting all terms that are unique from at least the dictionary tim and tip files? Will i get significant query time performance increase? Does any body know a class of regex that identify meaningless terms that i can add to my updateProcessor? Thanks Manu
Query specific replica
Hello, Since i replicated my shards (i have 2 cores per shard now), I get a remarkable decrease in qTime. I assume it happens since my memory has to split between twice more cores than it used to. In my low qps rate use-case, I use replications as shard backup only (in case one of my servers goes down) and not for the ability of serving parallel requests. In this case i decrease because the two cores of the shard are active. I was wondering wether it is possible to query the same core every request, instead of load balancing between the different replicas? And only if the leader replica goes down the second replica would start serving requests. Cheers, Manu
Re: solr-cloud performance decrease day by day
Can happen for various reasons. Can you recreate the situation, meaning restarting the servlet or server would start with good qTime and decrease from that point? How fast does this happen? Start by monitoring the jvm process, with oracle visualVM for example. Monitor for frequent garbage collections or unreasonable memory peacks or opening threads. Then monitor your system to see if there's an io disk latency or disk usage that increases in time, the writing queue to disk exploads, cpu load becomes heavier or network usage's exeeds limit. If you can recreate the decrease and monitor well, one of the above params should pop up. Fixing it after defining the problem will be easier. Good day, Manu On Apr 19, 2013 10:26 AM, qibaoyuan qibaoy...@gmail.com wrote:
Updating clusterstate from the zookeeper
Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Re: What are the pros and cons Having More Replica at SolrCloud
On the query side, another down side i see would be that for a given memory pool, you'd have to share it with more cores because every replica uses it's own cache. True for the inner solr caching (JVM's heap) and OS caching as well. Adding a replicated core creates a new data set (index) that will be accessed while queried. If your replication adds a core of shard1 on a server that includes only shard2, the OS caching and solr caching would have to share the RAM with totally different memory parts (as files and query results for different shards are different) so it's clear. In the second case, if you add a replicated core to a server that already contains shard1, I'm not sure. There might be benefits if JVM handles its caches per shard and not per core, but the OS caching would differentiate between the different replications of same index and try to add both index files on memory. Cheers, Manu So if you're short on memory or queries are alike (have high hit ration) you may better take advantage of your RAM usage than splitting it to many replications. On Fri, Apr 19, 2013 at 3:08 AM, Timothy Potter thelabd...@gmail.comwrote: re: more replicas - pro: you can scale your query processing workload because you have more nodes available to service queries, eg 1,000 QPS sent to Solr with 5 replicas, then each is only processing roughly 200 QPS. If you need to scale up to 10K QPS, then add more replicas to distribute the increased workload con: additional overhead (mostly network I/O) when indexing, shard leader has to send N additional requests per update where N is the number of replicas per shard. This seems minor unless you have many replicas per shard. I can't think of any cons of having more replicas on the query side As for your other question, when the leader receives an update request, it forwards to all replicas in the active or recovering state in parallel and waits for their response before responding to the client. All replicas must accept the update for it to be considered successful, i.e. all replicas and the leader must be in agreement on the status of a request. This is why you hear people referring to Solr as favoring consistency over write-availability. If you have 10 active replicas for a shard, then all 10 must accept the update or it fails, there's no concept of tunable consistency on a write in Solr. Failed / offline replicas are obviously ignored and they will sync up with the leader once they are back online. Cheers, Tim On Thu, Apr 18, 2013 at 4:48 PM, Furkan KAMACI furkankam...@gmail.com wrote: What are the pros and cons Having More Replica at SolrCloud? Also there is a point that I want to learn. When a request come to a leader. Does it forwards it to a replica. And if forwards it to replica, does replica works parallel to build up the index with other replicas of its same leader?
Re: Slow qTime for distributed search
Hi, We have different working hours, sorry for the reply delay. Your assumed numbers are right, about 25-30Kb per doc. giving a total of 15G per shard, there are two shards per server (+2 slaves that should do no work normally). An average query has about 30 conditions (OR AND mixed), most of them textual, a small part on dateTime. They use only simple queries (no facet, filters etc.) as it is taken from the actual query set of my entreprise that works with an old search engine. As we said, if the shards in collection1 and collection2 have the same number of docs each (and same RAM CPU per shard), it is apparently not a slow IO issue, right? So the fact of not having cached all my index doesn't seem the be the bottleneck.Moreover, i do store the fields but my query set requests only the id's and rarely snippets so I'd assume that the plenty of RAM i'd give the OS wouldn't make any difference as these *.fdt files don't need to get cached. The conclusion i get to is that the merging issue is the problem, and the only possibility of outsmarting it is to distribute to much fewer shards, meaning that i'll get back to few millions of docs per shard which are about linearly slower with the num of docs per shard. Though the latter should improve if i give much more RAM per server. I'll try tweaking a bit my schema and making better use of solr cache (filter query as an example), but i have something telling me the problem might be elsewhere. My main clue to it is that merging seems a simple CPU task, and tests show that even with a small amount of responses it takes a long time (and clearly the merging task on few docs is very short) On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey s...@elyograg.org wrote: On 4/9/2013 3:50 PM, Furkan KAMACI wrote: Hi Shawn; You say that: *... your documents are about 50KB each. That would translate to an index that's at least 25GB* I know we can not say an exact size but what is the approximately ratio of document size / index size according to your experiences? If you store the fields, that is actual size plus a small amount of overhead. Starting with Solr 4.1, stored fields are compressed. I believe that it uses LZ4 compression. Some people store all fields, some people store only a few or one - an ID field. The size of stored fields does have an impact on how much OS disk cache you need, but not as much as the other parts of an index. It's been my experience that termvectors take up almost as much space as stored data for the same fields, and sometimes more. Starting with Solr 4.2, termvectors are also compressed. Adding docValues (new in 4.2) to the schema will also make the index larger. The requirements here are similar to stored fields. I do not know whether this data gets compressed, but I don't think it does. As for the indexed data, this is where I am less clear about the storage ratios, but I think you can count on it needing almost as much space as the original data. If the schema uses types or filters that produce a lot of information, the indexed data might be larger than the original input. Examples of data explosions in a schema: trie fields with a non-zero precisionStep, the edgengram filter, the shingle filter. Thanks, Shawn
Re: Slow qTime for distributed search
Thanks for replying. My config: - 40 dedicated servers, dual-core each - Running Tomcat servlet on Linux - 12 Gb RAM per server, splitted half between OS and Solr - Complex queries (up to 30 conditions on different fields), 1 qps rate Sharding my index was done for two reasons, based on 2 servers (4shards) tests: 1. As index grew above few million of docs qTime raised greatly, while sharding the index to smaller pieces (about 0.5M docs) gave way better results, so I bound every shard to have 0.5M docs. 2. Tests showed i was cpu-bounded during queries. As i have low qps rate (emphasize: lower than expected qTime) and as a query runs single-threaded on each shard, it made sense to accord a cpu to each shard. For the same amount of docs per shards I do expect a raise in total qTime for the reasons: 1. The response should wait for the slowest shard 2. Merging the responses from 40 different shards takes time What i understand from your explanation is that it's the merging that takes time and as qTime ends only after the second retrieval phase, the qTime on each shard will take longer. Meaning during a significant proportion of the first query phase (right after the [id,score] are retieved), all cpu's are idle except the response-merger thread running on a single cpu. I thought of the merge as a simple sorting of [id,score], way more simple than additional 300 ms cpu time. Why would a RAM increase improve my performances, as it's a response-merge (CPU resource) bottleneck? Thanks in advance, Manu On Mon, Apr 8, 2013 at 10:19 PM, Shawn Heisey s...@elyograg.org wrote: On 4/8/2013 12:19 PM, Manuel Le Normand wrote: It seems that sharding my collection to many shards slowed down unreasonably, and I'm trying to investigate why. First, I created collection1 - 4 shards*replicationFactor=1 collection on 2 servers. Second I created collection2 - 48 shards*replicationFactor=2 collection on 24 servers, keeping same config and same num of documents per shard. The primary reason to use shards is for index size, when your index is so big that a single index cannot give you reasonable performance. There are also sometimes performance gains when you break a smaller index into shards, but there is a limit. Going from 2 shards to 3 shards will have more of an impact that going from 8 shards to 9 shards. At some point, adding shards makes things slower, not faster, because of the extra work required for combining multiple queries into one result response. There is no reasonable way to predict when that will happen. Observations showed the following: 1. Total qTime for the same query set is 5 time higher in collection2 (150ms-700 ms) 2. Adding to colleciton2 the *shard.info=true* param in the query shows that each shard is much slower than each shard was in collection1 (about 4 times slower) 3. Querying only specific shards on collection2 (by adding the shards=shard1,shard2...shard12 param) gave me much better qTime per shard (only 2 times higher than in collection1) 4. I have a low qps rate, thus i don't suspect the replication factor for being the major cause of this. 5. The avg. cpu load on servers during querying was much higher in collection1 than in collection2 and i didn't catch any other bottlekneck. A distributed query actually consists of up to two queries per shard. The first query just requests the uniqueKey field, not the entire document. If you are sorting the results, then the sort field(s) are also requested, otherwise the only additional information requested is the relevance score. The results are compiled into a set of unique keys, then a second query is sent to the proper shards requesting specific documents. Q: 1. Why does the amount of shards affect the qTime of each shard? 2. How can I overcome to reduce back the qTime of each shard? With more shards, it takes longer for the first phase to compile the results, so the second phase (document retrieval) gets delayed, and the QTime goes up. One way to reduce the total time is to reduce the number of shards. You haven't said anything about how complex your queries are, your index size(s), or how much RAM you have on each server and how it is allocated. Can you provide this information? Getting good performance out of Solr requires plenty of RAM in your OS disk cache. Query times of 150 to 700 milliseconds seem very high, which could be due to query complexity or a lack of server resources (especially RAM), or possibly both. Thanks, Shawn
Re: Slow qTime for distributed search
After taking a look on what I'd wrote earlier, I will try to rephrase in a clear manner. It seems that sharding my collection to many shards slowed down unreasonably, and I'm trying to investigate why. First, I created collection1 - 4 shards*replicationFactor=1 collection on 2 servers. Second I created collection2 - 48 shards*replicationFactor=2 collection on 24 servers, keeping same config and same num of documents per shard. Observations showed the following: 1. Total qTime for the same query set is 5 time higher in collection2 (150ms-700 ms) 2. Adding to colleciton2 the *shard.info=true* param in the query shows that each shard is much slower than each shard was in collection1 (about 4 times slower) 3. Querying only specific shards on collection2 (by adding the shards=shard1,shard2...shard12 param) gave me much better qTime per shard (only 2 times higher than in collection1) 4. I have a low qps rate, thus i don't suspect the replication factor for being the major cause of this. 5. The avg. cpu load on servers during querying was much higher in collection1 than in collection2 and i didn't catch any other bottlekneck. Q: 1. Why does the amount of shards affect the qTime of each shard? 2. How can I overcome to reduce back the qTime of each shard? Thanks, Manu
Slow qTime for distributed search
Hello After performing a benchmark session on small scale i moved to a full scale on 16 quad core servers. Observations at small scale gave me excellent qTime (about 150 ms) with up to 2 servers, showing my searching thread was mainly cpu bounded. My query set is not faceted. Growing to full scale (with same config schema num of docs per shard) i sharded my collection to 48 shards and added a replication for each. Since then i have a major performance deteriotaion, my qTime went up to 700 msec. Servers have a much smaller load, and network does not show any difficulties. I understand that the response merging and waiting for the slowest shard response should increase my small scale qTime, so checked shard.info=true to observe that each shard was taking much longer, while defining query for specific shards (shards=shard1,shard2...shard12) i get much better results for each shard qTime and total qTime. Keeping the same config, how come the num of shards affects the qTime of each shard? How can i evercome this issue? Thanks, Manu
Re: Is Solr more CPU bound or IO bound?
Your question is a typical use-case dependent, the bottleneck will change from user to user. These are two main issues that will affect the answer: 1. How do you index: what is your indexing rate (how many docs a days)? how big is a typical document? how many documents do you plan on indexing in tota? do you store fields? calculate their term vectors? 2. How looks you retrieval process: What's the query rate expected? Are there common queries (taking advantage of the cache)? Complexity of queries (faceted / highlighted / filtered/ how many conditions, NRT)? Do you plan to retrieve stored fields or only id's? After answering all that there's an interative game between hardware configuration and software configuration (how do you split your shards, use your cache, tuning your merges and flushes etc) that would also affect the IO / CPU bounded answer. In my use-case for example the indexing part is IO bounded, but as my indexing rate is much below the rate my machine could initially provide it didn't affect my hardware spec. After fine tuning my configuration i discovered my retrieval process was CPU bounded and was directly affecting my avg response time, while the IO rate in cache usage was quite low. Try describing your use case in more details with the above questions so we'd be able to give you guidelines. Best, Manu On Mon, Mar 18, 2013 at 3:55 AM, David Parks davidpark...@yahoo.com wrote: I'm spec'ing out some hardware for a first go at our production Solr instance, but I haven't spent enough time loadtesting it yet. What I want to ask if how IO intensive solr is vs. CPU intensive, typically. Specifically I'm considering whether to dual-purpose the Solr servers to run Solr and another CPU-only application we have. I know Solr uses a fair amount of CPU, but if it also is very disk intensive it might be a net benefit to have more instances running Solr and share the CPU resources with the other app than to run Solr separate from the other CPU app that wouldn't otherwise use the disk. Thoughts on this? Thanks, David
Re: Optimization storage issue
Hi Tim - thanks for the answer. For your assumption: my documents are about 50kb each in the index, but after two weeks of updating and not removing i have about 40% percent of unused docs in my index and that has an impact on the query performance. 1) My incentive for optimizing and not merging was to take advantage of the dead hours of the engine, hours in local night that have low qps rate. Thus i would control the hours during which these operations occur and the merging and query threads wouldn't have to compete on the same resources - correct me if i'm mistaking. 2) Using expunge deletes attribute might be an interesting option as the segments that contain deleted docs should be only a few as they were created in the same time range (a month before), but in my case i have few deleted docs even in new segments for various reasons. If I use this suggested commit and all my segments contain deleted docs it would result into optimizing, wouldn't it? Is there an option of controlling expunge deletes to more than N deleted docs, so i would avoid the pseudo-optimize process. Manu On Sat, Mar 2, 2013 at 8:54 PM, Timothy Potter thelabd...@gmail.com wrote: Hi Manuel, If you search optimize on this mailing list, you'll see that one of the common suggestions is to avoid optimizing and fine-tune segment merging instead. So to begin, take a look at your solrconfig.xml and find out what your merge policy and mergeFactor are set to (note: they may be commented out which implies segment merging is still enabled with the default settings). You can experiment with changing the mergeFactor. Based on your description of adding and removing a few thousand documents each day, I'm going to assume your documents are very large otherwise I can't see how you'd ever notice an impact on query performance. Is my assumption about the document size correct? One thing you can try is to use the expungeDeletes attribute set to true when you commit, ie. commit expungeDeletes=true/. This triggers Solr to merge any segments with deletes. Lastly, I'm not sure about your specific questions related to optimizations, but I think it's worth trying the suggestions above and avoid optimizations altogether. I'm pretty sure the answer to #1 is no and for #2 is it optimizes independently. Cheers, Tim On Sat, Mar 2, 2013 at 10:24 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: My use-case is a casi-monthly changing index. Everyday i index few thousands of docs and erase a similar number of older documents, whilst few documents last in the index for ever (about 20 % of my index). After few experiments, i get that leaving the older documents in the index (mostly in the *.tim file) slows down significally my avg qTime and got to the conclusion i need to optimize the index once every few days to get ride of the older documents. Optimization requires about 2 times more the index storage. As i have many shards and one replica for each, and the optimization occurs simultaneously for all, i need twice the amount of storage of my initial index size, while half of it is used very unfrequently (optimization takes about an hour). 1) Is there a possibility of using a storage pool for all shards, so every shard uses the spare storage in series, forcing the optimization to run unsimultaneously. In this case all the storage i'd use would be (total index storage + shard storage) instead of twice the total index storage. 2) When i run optimization for a replicated core, does it copy from its leader or does it optimize independenly? Thanks, Manu
Re: Threads running while querrying
Yes, i made a one threaded script which sends a querry by a post request to the shard's url, gets back the response and posts the next querry. How can it matter? Manuel On Wednesday, February 20, 2013, Erick Erickson wrote: Silly question perhaps, but are you feeding queries at Solr with a single thread? Because Solr uses multiple threads to search AFAIK. Best Erick On Wed, Feb 20, 2013 at 4:01 AM, Manuel Le Normand manuel.lenorm...@gmail.com javascript:; wrote: More to it, i do see 75 more threads under the process of tomcat6, but only a single one is working while querrying On Wednesday, February 20, 2013, Manuel Le Normand wrote: Hello, I created a single collection on a linux server with 8m docs. Solr 4.1 While making performance tests, i see that my quad core server makes a full use of a single core while the 3 others are idle. Is there a possibility of making a single sharded collection available for multi-threaded querry? P.s: im not indexing while querrying