Re: Solr performance issues
Thanks all. I've the same index with a bit different schema and 200M documents, installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size of index is about 1.5TB, have many updates every 5 minutes, complex queries and faceting with response time of 100ms that is acceptable for us. Toke Eskildsen, Is the index updated while you are searching? *No* Do you do any faceting or other heavy processing as part of a search? *No* How many hits does a search typically have and how many documents are returned? *The test for QTime only with no documents returned and No. of hits varying from 50,000 to 50,000,000.* How many concurrent searches do you need to support? How fast should the response time be? *May be 100 concurrent searches with 100ms with facets.* Does splitting the shard to two shards on the same node so every shard will be on a single EBS Volume better than using LVM? Thanks On Mon, Dec 29, 2014 at 2:00 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Mahmoud Almokadem [prog.mahm...@gmail.com] wrote: We've installed a cluster of one collection of 350M documents on 3 r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS General purpose (1x1TB + 1x500GB) on each instance. Then we create logical volume using LVM of 1.5TB to fit our index. Your search speed will be limited by the slowest storage in your group, which would be your 500GB EBS. The General Purpose SSD option means (as far as I can read at http://aws.amazon.com/ebs/details/#piops) that your baseline of 3 IOPS/MB = 1500 IOPS, with bursts of 3000 IOPS. Unfortunately they do not say anything about latency. For comparison, I checked the system logs from a local test with our 21TB / 7 billion documents index. It used ~27,000 IOPS during the test, with mean search time a bit below 1 second. That was with ~100GB RAM for disk cache, which is about ½% of index size. The test was with simple term queries (1-3 terms) and some faceting. Back of the envelope: 27,000 IOPS for 21TB is ~1300 IOPS/TB. Your indexes are 1.1TB, so 1.1*1300 IOPS ~= 1400 IOPS. All else being equal (which is never the case), getting 1-3 second response times for a 1.1TB index, when one link in the storage chain is capped at a few thousand IOPS, you are using networked storage and you have little RAM for caching, does not seem unrealistic. If possible, you could try temporarily boosting performance of the EBS, to see if raw IO is the bottleneck. The response time is about 1 and 3 seconds for simple queries (1 token). Is the index updated while you are searching? Do you do any faceting or other heavy processing as part of a search? How many hits does a search typically have and how many documents are returned? How many concurrent searches do you need to support? How fast should the response time be? - Toke Eskildsen
Re: SolrCloud Paging on large indexes
On 12/23/2014 04:07 PM, Toke Eskildsen wrote: The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window. That is very good to hear. Thanks. Nobody will hit next 499 times, but a lot of our users skip to the last page quite often. Maybe I should make *that* as hard as possible. Hmm. Issue a search with sort in reverse order, then reverse the returned list of documents? Sneaky. I like it. But in the end we're simply getting rid of the last-button. Solves a lot of issues. If have a billion search results, you might as well refine your criteria! - Bram
How large is your solr index?
Hi folks, I'm trying to get a feel of how large Solr can grow without slowing down too much. We're looking into a use-case with up to 100 billion documents (SolrCloud), and we're a little afraid that we'll end up requiring 100 servers to pull it off. The largest index we currently have is ~2billion documents in a single Solr instance. Documents are smallish (5k each) and we have ~50 fields in the schema, with an index size of about 2TB. Performance is mostly OK. Cold searchers take a while, but most queries are alright after warming up. I wish I could provide more statistics, but I only have very limited access to the data (...banks...). I'd very grateful to anyone sharing statistics, especially on the larger end of the spectrum -- with or without SolrCloud. Thanks, - Bram
Re: Loading data to FieldValueCache
On Fri, Dec 26, 2014 at 12:26 PM, Erick Erickson erickerick...@gmail.com wrote: I don't know the complete algorithm, but if the number of docs that satisfy the fq is small enough, then just the internal Lucene doc IDs are stored rather than a bitset. If smaller than maxDoc/64 ids are collected, a sorted int set is used instead of a bitset. Also, the enum method can skip caching for the smaller terms: facet.enum.cache.minDf=100 might be good for general purpose. Or set the value really high to not use the filter cache at all. -Yonik
Highlighting do not show for some solr results
Hello, I turned on highlighting and some records do not have highlight text (See image below): [cid:image001.png@01D02358.A0E23D60] Does anyone know why this is happening and how I can fix it? Here is the querystring I am using wt=jsonjson.wrf=?indent=truehl=truehl.fl=title,contenthl.tag.pre=emhl.tag.post=/emhl.snippets=2. Thanks
Re: Highlighting do not show for some solr results
two things: 1 attachments rarely make it through the e-mail system, you have to put things like screenshots out on different servers and provide a link. 2 I did see the attachment in my moderator role and it's not clear what your problem really is. I'm _guessing_ that your complaint is that the top few returns are just the file names, there's no text. In that case, you're probably matching some other field than text but highlighting on the text field. Do you perhaps have your request handler configured to use edismax and are searching across multiple fields? Best, Erick On Mon, Dec 29, 2014 at 8:14 AM, Volel, Andre avo...@bklynlibrary.org wrote: Hello, I turned on highlighting and some records do not have highlight text (See image below): Does anyone know why this is happening and how I can fix it? Here is the querystring I am using “ wt=jsonjson.wrf=?indent=truehl=truehl.fl=title,contenthl.tag.pre=emhl.tag.post=/emhl.snippets=2 ”. Thanks
Re: How large is your solr index?
When you say 2B docs on a single Solr instance, are you talking only one shard? Because if you are, you're very close to the absolute upper limit of a shard, internally the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems. But yeah, your 100B documents are going to use up a lot of servers... Best, Erick On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu wrote: Hi folks, I'm trying to get a feel of how large Solr can grow without slowing down too much. We're looking into a use-case with up to 100 billion documents (SolrCloud), and we're a little afraid that we'll end up requiring 100 servers to pull it off. The largest index we currently have is ~2billion documents in a single Solr instance. Documents are smallish (5k each) and we have ~50 fields in the schema, with an index size of about 2TB. Performance is mostly OK. Cold searchers take a while, but most queries are alright after warming up. I wish I could provide more statistics, but I only have very limited access to the data (...banks...). I'd very grateful to anyone sharing statistics, especially on the larger end of the spectrum -- with or without SolrCloud. Thanks, - Bram
Re: Loading data to FieldValueCache
bq: There will be no updates to my index. So, no worries about ageing out or garbage collection This is irrelevant to aging out filterCache entries, this is purely query time. bq: Each having 64 GB of RAM, out of which I am allocating 45 GB to Solr. It's usually a mistake to give Solr so much ram relative to the OS, see Uwe's excellent blog here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html That said, you know your system best. And the fact that you have so many shards may well mean that memory considerations aren't relevant. Personally, though, I think you've massively over-sharded your collection and are incurring significant overhead, but again you know your requirements much better than I do. Best, Erick On Mon, Dec 29, 2014 at 7:43 AM, Yonik Seeley yo...@heliosearch.com wrote: On Fri, Dec 26, 2014 at 12:26 PM, Erick Erickson erickerick...@gmail.com wrote: I don't know the complete algorithm, but if the number of docs that satisfy the fq is small enough, then just the internal Lucene doc IDs are stored rather than a bitset. If smaller than maxDoc/64 ids are collected, a sorted int set is used instead of a bitset. Also, the enum method can skip caching for the smaller terms: facet.enum.cache.minDf=100 might be good for general purpose. Or set the value really high to not use the filter cache at all. -Yonik
Re: Solr performance issues
On 12/29/2014 2:36 AM, Mahmoud Almokadem wrote: I've the same index with a bit different schema and 200M documents, installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size of index is about 1.5TB, have many updates every 5 minutes, complex queries and faceting with response time of 100ms that is acceptable for us. Toke Eskildsen, Is the index updated while you are searching? *No* Do you do any faceting or other heavy processing as part of a search? *No* How many hits does a search typically have and how many documents are returned? *The test for QTime only with no documents returned and No. of hits varying from 50,000 to 50,000,000.* How many concurrent searches do you need to support? How fast should the response time be? *May be 100 concurrent searches with 100ms with facets.* Does splitting the shard to two shards on the same node so every shard will be on a single EBS Volume better than using LVM? The basic problem is simply that the system has so little memory that it must read large amounts of data from the disk when it does a query. There is not enough RAM to cache the important parts of the index. RAM is much faster than disk, even SSD. Typical consumer-grade DDR3-1600 memory has a data transfer rate of about 12800 megabytes per second. If it's ECC memory (which I would say is a requirement) then the transfer rate is probably a little bit slower than that. Figuring 9 bits for every byte gets us about 11377 MB/s. That's only an estimate, and it could be wrong in either direction, but I'll go ahead and use it. http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules If your SSD is SATA, the transfer rate will be limited to approximately 600MB/s -- the 6 gigabit per second transfer rate of the newest SATA standard. That makes memory about 18 times as fast as SATA SSD. I saw one PCI express SSD that claimed a transfer rate of 2900 MB/s. Even that is only about one fourth of the estimated speed of DDR3-1600 with ECC. I don't know what interface technology Amazon uses for their SSD volumes, but I would bet on it being the cheaper version, which would mean SATA. The networking between the EC2 instance and the EBS storage is unknown to me and may be a further bottleneck. http://ocz.com/enterprise/z-drive-4500/specifications Bottom line -- you need a lot more memory. Speeding up the disk may *help* ... but it will not replace that simple requirement. With EC2 as the platform, you may need more instances and more shards. Your 200 million document index that works well with only 90GB of total memory ... that's surprising to me. That means that the important parts of that index *do* fit in memory ... but if the index gets much larger, performance is likely to drop off sharply. Thanks, Shawn
Re: Solr performance issues
Thanks Shawn. What do you mean with important parts of index? and how to calculate their size? Thanks, Mahmoud Sent from my iPhone On Dec 29, 2014, at 8:19 PM, Shawn Heisey apa...@elyograg.org wrote: On 12/29/2014 2:36 AM, Mahmoud Almokadem wrote: I've the same index with a bit different schema and 200M documents, installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size of index is about 1.5TB, have many updates every 5 minutes, complex queries and faceting with response time of 100ms that is acceptable for us. Toke Eskildsen, Is the index updated while you are searching? *No* Do you do any faceting or other heavy processing as part of a search? *No* How many hits does a search typically have and how many documents are returned? *The test for QTime only with no documents returned and No. of hits varying from 50,000 to 50,000,000.* How many concurrent searches do you need to support? How fast should the response time be? *May be 100 concurrent searches with 100ms with facets.* Does splitting the shard to two shards on the same node so every shard will be on a single EBS Volume better than using LVM? The basic problem is simply that the system has so little memory that it must read large amounts of data from the disk when it does a query. There is not enough RAM to cache the important parts of the index. RAM is much faster than disk, even SSD. Typical consumer-grade DDR3-1600 memory has a data transfer rate of about 12800 megabytes per second. If it's ECC memory (which I would say is a requirement) then the transfer rate is probably a little bit slower than that. Figuring 9 bits for every byte gets us about 11377 MB/s. That's only an estimate, and it could be wrong in either direction, but I'll go ahead and use it. http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules If your SSD is SATA, the transfer rate will be limited to approximately 600MB/s -- the 6 gigabit per second transfer rate of the newest SATA standard. That makes memory about 18 times as fast as SATA SSD. I saw one PCI express SSD that claimed a transfer rate of 2900 MB/s. Even that is only about one fourth of the estimated speed of DDR3-1600 with ECC. I don't know what interface technology Amazon uses for their SSD volumes, but I would bet on it being the cheaper version, which would mean SATA. The networking between the EC2 instance and the EBS storage is unknown to me and may be a further bottleneck. http://ocz.com/enterprise/z-drive-4500/specifications Bottom line -- you need a lot more memory. Speeding up the disk may *help* ... but it will not replace that simple requirement. With EC2 as the platform, you may need more instances and more shards. Your 200 million document index that works well with only 90GB of total memory ... that's surprising to me. That means that the important parts of that index *do* fit in memory ... but if the index gets much larger, performance is likely to drop off sharply. Thanks, Shawn
Re: How large is your solr index?
Like all things it really depends on your use case. We have 160B documents in our largest SolrCloud and doing a *:* to get that count takes ~13-14 seconds. Doing a text:happy query only takes ~3.5-3.6 seconds cold, subsequent queries for the same terms take 500ms. We have a little over 3TB of RAM in the cluster which is around 1/10th size on disk which are fast SSDs (rated 300K IOPS per machine), but more importantly we are using 12-13 large machines rather than dozens or hundreds of small machines, and if your use case is primarily full text search you probably could get away with even fewer machines depending on query patterns. We run several JVMs per machine and many shards per JVM, but are careful to order shards so that queries get dispersed across multiple JVMs across multiple machines wherever possible. Facets over high cardinality fields are going to be painful. We currently programmatically limit the range to around 1/12th or 1/13th of the data set for facet queries, but plan on evaluating Heliosearch (initial results didn't look promising) and Toke's sparse faceting patch (SOLR-5894) to help out there. If any given JVM goes OOM that also becomes a rough time operationally. If your indexing rate spikes past what your sharding strategy can handle, that sucks too. There could be more support / ease of use enhancements for moving shards across SolrClouds, moving shards across physically nodes within a SolrCloud, and snapshot/restore of a SolrCloud, but there has also been a lot of recent work in these areas that are starting to provide the underlying infrastructure for more advanced shard management. I think there are more people getting into the space of 100B documents but I only ran into or discovered a handful during my time at Lucene/Solr Revolution this November. The majority of large scale SolrCloud users seem to have many collections (collections per logical user) rather than many documents in one/few collections. Regards, --Ralph On Mon Dec 29 2014 at 11:55:41 AM Erick Erickson erickerick...@gmail.com wrote: When you say 2B docs on a single Solr instance, are you talking only one shard? Because if you are, you're very close to the absolute upper limit of a shard, internally the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems. But yeah, your 100B documents are going to use up a lot of servers... Best, Erick On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu wrote: Hi folks, I'm trying to get a feel of how large Solr can grow without slowing down too much. We're looking into a use-case with up to 100 billion documents (SolrCloud), and we're a little afraid that we'll end up requiring 100 servers to pull it off. The largest index we currently have is ~2billion documents in a single Solr instance. Documents are smallish (5k each) and we have ~50 fields in the schema, with an index size of about 2TB. Performance is mostly OK. Cold searchers take a while, but most queries are alright after warming up. I wish I could provide more statistics, but I only have very limited access to the data (...banks...). I'd very grateful to anyone sharing statistics, especially on the larger end of the spectrum -- with or without SolrCloud. Thanks, - Bram
[ANNOUNCE] Apache Solr 4.10.3 released
December 2014, Apache Solr™ 4.10.3 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.3 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.3 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.10.3 includes 21 bug fixes, as well as Lucene 4.10.3 and its 12 bug fixes. This release fixes the following security vulnerability that has affected Solr since the Solr 4.0 Alpha release. CVE-2014-3628: Stored XSS vulnerability in Solr Admin UI. Information disclosure: The Solr Admin UI Plugin / Stats page does not escape data values which allows an attacker to execute javascript by executing a query that will be stored and displayed via the 'fieldvaluecache' object. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy Holidays, Mark Miller http://www.about.me/markrmiller
Re: Solr performance issues
On 12/29/2014 12:07 PM, Mahmoud Almokadem wrote: What do you mean with important parts of index? and how to calculate their size? I have no formal education in what's important when it comes to doing a query, but I can make some educated guesses. Starting with this as a reference: http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene410/package-summary.html#file-names I would guess that the segment info (.si) files and the term index (*.tip) files would be supremely important to *always* have in memory, and they are fairly small. Next would be the term dictionary (*.tim) files. The term dictionary is pretty big, and would be very important for fast queries. Frequencies, positions, and norms may also be important, depending on exactly what kind of query you have. Frequencies and positions are quite large. Frequencies are critical for relevence ranking (the default sort by score), and positions are important for phrase queries. Position data may also be used by relevance ranking, but I am not familiar enough with it to say for sure. If you have docvalues defined, then *.dvm and *.dvd files would be used for facets and sorting on those specific fields. The *.dvd files can be very big, depending on your schema. The *.fdx and *.fdt files become important when actually retrieving results after the matching documents have been determined. The stored data is compressed, so additional CPU power is required to uncompress that data before it is sent to the client. Stored data may be large or small, depending on your schema. Stored data does not directly affect search speed, but if memory space is limited, every block of stored data that gets retrieved will result in some other part of the index being removed from the OS disk cache, which means that it might need to be re-read from the disk on the next query. Thanks, Shawn
Re: How large is your solr index?
And that Lucene index document limit includes deleted and updated documents, so even if your actual document count stays under 2^31-1, deleting and updating documents can push the apparent document count over the limit unless you very aggressively merge segments to expunge deleted documents. -- Jack Krupansky -- Jack Krupansky On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson erickerick...@gmail.com wrote: When you say 2B docs on a single Solr instance, are you talking only one shard? Because if you are, you're very close to the absolute upper limit of a shard, internally the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems. But yeah, your 100B documents are going to use up a lot of servers... Best, Erick On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu wrote: Hi folks, I'm trying to get a feel of how large Solr can grow without slowing down too much. We're looking into a use-case with up to 100 billion documents (SolrCloud), and we're a little afraid that we'll end up requiring 100 servers to pull it off. The largest index we currently have is ~2billion documents in a single Solr instance. Documents are smallish (5k each) and we have ~50 fields in the schema, with an index size of about 2TB. Performance is mostly OK. Cold searchers take a while, but most queries are alright after warming up. I wish I could provide more statistics, but I only have very limited access to the data (...banks...). I'd very grateful to anyone sharing statistics, especially on the larger end of the spectrum -- with or without SolrCloud. Thanks, - Bram
RE: How large is your solr index?
Bram Van Dam [bram.van...@intix.eu] wrote: I'm trying to get a feel of how large Solr can grow without slowing down too much. We're looking into a use-case with up to 100 billion documents (SolrCloud), and we're a little afraid that we'll end up requiring 100 servers to pull it off. One recurring theme on this list is that it is very hard to compare indexes. Even if the data structure happens to be the same, performance will very drastically depending on the types of queries and the processing requested. That being said, I acknowledge that it helps with stories to get a feel of what can be done. One second caveat is that I find it an exercise in futility to talk about scale without an idea of expected response times as well as the expected number of concurrent users. If you are just doing some nightly batch processing, you could probably run your (scaling up from your description) 100TB index off spinning drives on a couple of boxes. If you expect to be hammered with millions of requests per day, you would have to put a zero or two behind that number. End of sermon. At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught on. The only entry is for our (State and University Library, Denmark) setup with 21TB / 7 billion documents on a single machine. To follow my own advice, I can elaborate that we have 1-3 concurrent users and a design goal of median response times below 2 seconds for faceted search. I guess that is at the larger end at the spectrum for pure size, but at the very low end for usage. - Toke Eskildsen
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUMALPHANUMALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUMALPHANUMALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using defaults, not sure why I chose non-defaults originally. I still need to find time to make a smaller isolation/reproduction case, I'm getting confusing results that suggest some other part of my field def may be pertinent. I'll come back when I've done that (hopefully next week), and include the _parsed_ from debug=query then. Thanks! Jonathan On 9/2/14 4:26 PM, Erick Erickson wrote: What happens if you append
Re: WordDelimiter filter, expanding to multiple words, unexpected results
WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction - the index analyzer would index as you have indicated, indexing both the unitary term and the multi-term phrase, while the query analyzer would NOT do the split on case, so that the query could be a unitary term (possibly with mixed case, but that would not split the term) or could be a two-word phrase. -- Jack Krupansky -- Jack Krupansky On Mon, Dec 29, 2014 at 5:12 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my
RE: Solr performance issues
Mahmoud Almokadem [prog.mahm...@gmail.com] wrote: I've the same index with a bit different schema and 200M documents, installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size of index is about 1.5TB, have many updates every 5 minutes, complex queries and faceting with response time of 100ms that is acceptable for us. So you have Setup 1: 3 * (30GB RAM + 600GB SSD) for a total of 1.5TB index 200M docs. Acceptable performance. Setup 2: 3 * (60GB RAM + 1TB SSD + 500GB SSD) for a total of 3.3TB 350M docs. Poor performance. The only real difference, besides doubling everything, is the LVM? I understand why you find that to be the culprit, but from what I can read, the overhead should not be anywhere near enough to result in the performance drop you are describing. Could it be that some snapshotting or backup was running when you tested? Splitting your shards and doubling the number of machines, as you suggest, would result in Setup 3: 6 * (60GB RAM + 600GB SSD) for a total of 3.3TB 350M docs. which would be remarkable similar to your setup 1. I think that would be the next logical step, unless you can easily do a temporary boost of your IOPS. BTW: You are getting dangerously close to your storage limits here - it seems that a single large merge could make you run out of space. - Toke Eskildsen
Re: WordDelimiter filter, expanding to multiple words, unexpected results
splitOnCaseChange=1 So, it does not get split during indexing because there is no case change. But does get split during search and now you are looking for partial tokens against a combined single-token in the index. And not matching. The WordDelimiterFilterFactory is more for product IDs that have multitudes of spellings. Your use-case seems to be a lot more of just matching with ignoring case (looking at last email only). Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 29 December 2014 at 17:12, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
no replication using commitWithin via curl?
Hi, We've noticed that when we send deletes to our SolrCloud cluster via curl with the param commitWithin=1 specified, the deletes are applied and are visible to the leader node, but aren't replicated to other nodes. The problem can be worked around by issuing an explicit (hard) commit. Is this expected behaviour? Can anyone shed light on what is going on here? Thanks, -Brendan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 29 December 2014 at 18:07, Jonathan Rochkind rochk...@jhu.edu wrote: I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I am sure you do know that, but just in case. At the moment, you have only one analyzer chain, so it applies at both index and query time. You can split those and have separate treatment during indexing and during search. Useful with synonyms, etc. The example schema has both versions shown. But I would start by just removing splitOnCaseChange attribute and reindexing. I don't think that flag means what you want it to mean. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/
Re: How to implement multi-set in a Solr schema.
Thanks Jack, inorder to not affect the query time , what are the options available to handle this as index time ? So that I group all the similar books at index time by placing them in some kind of a set , and retrive all the contents of the set at query time if any one them matches the query. On Dec 29, 2014 12:49 AM, Jack Krupansky jack.krupan...@gmail.com wrote: You can also use group.query or group.func to group documents matching a query or unique values of a function query. For the latter you could implement an NLP algorithm. -- Jack Krupansky On Sun, Dec 28, 2014 at 5:56 PM, Meraj A. Khan mera...@gmail.com wrote: Thanks Aman, the thing is the bookName field values are not exactly identical , but nearly identical , so at the time of indexing I need to figure out which other book name field this is similar to using NLP techniques and then put it in the appropriate bag, so that at the retrieval time I only retrieve all the elements from that bag if any one of the element matches with the search query. Thanks. On Dec 28, 2014 1:54 PM, Aman Tandon amantandon...@gmail.com wrote: HI, You can use the grouping in the solr. You can does this by via query or via solrconfig.xml. *A) via query* http://localhost:8983?your_query_params*group=truegroup.field=bookName* You can limit the size of group (how many documents you wants to show), suppose you want to show 5 documents per group on this bookName field then you can specify the parameter *group.limit=5.* *B) via solrconfig* str name=grouptrue/str str name=group.field*bookName*/str str name=group.ngroupstrue/str str name=group.truncatetrue/str With Regards Aman Tandon On Sun, Dec 28, 2014 at 10:29 PM, S.L simpleliving...@gmail.com wrote: Hi All, I have a use case where I need to group documents that have a same field called bookName , meaning if there are a multiple documents with the same bookName value and if the user input is searched by a query on bookName , I need to be able to group all the documents by the same bookName together, so that I could display them as a group in the UI. What kind of support does Solr provide for such a scenario , and how should I look at changing my schema.xml which as bookName as single valued text field ? Thanks.
Re: no replication using commitWithin via curl?
I've confirmed this is also happens with deletes via SolrJ with commitWithin - the document is deleted from the leader but the delete is not replicated to other nodes. Document updates are replicated fine. Any help in debugging this behaviour would be much appreciated. Cheers, -Brendan On 30 December 2014 at 10:11, Brendan Humphreys bren...@canva.com wrote: Hi, We've noticed that when we send deletes to our SolrCloud cluster via curl with the param commitWithin=1 specified, the deletes are applied and are visible to the leader node, but aren't replicated to other nodes. The problem can be worked around by issuing an explicit (hard) commit. Is this expected behaviour? Can anyone shed light on what is going on here? Thanks, -Brendan
poor performance when connecting to CloudSolrServer(zkHosts) using solrJ
hi, I setups a SolrCloud, and code a simple solrJ program to query solr data as below, but it takes about 40 seconds to new CloudSolrServer instance,less than 100 miliseconds is acceptable. what is going on when new CloudSolrServer? and how to fix this issue? String zkHost = bicenter1.dcc:2181,datanode2.dcc:2181; String defaultCollection = hdfsCollection; long startms=System.currentTimeMillis(); CloudSolrServer server = new CloudSolrServer(zkHost); server.setDefaultCollection(defaultCollection); server.setZkConnectTimeout(3000); server.setZkClientTimeout(6000); long endms=System.currentTimeMillis(); System.out.println(endms-startms); ModifiableSolrParams params = new ModifiableSolrParams(); params.set(q, id:*hbase*); params.set(sort, price desc); params.set(start, 0); params.set(rows, 10); try { QueryResponse response=server.query(params); SolrDocumentList results = response.getResults(); for (SolrDocument doc:results) { String rowkey=doc.getFieldValue(id).toString(); } } catch (SolrServerException e) { // TODO Auto-generated catch block e.printStackTrace(); } server.shutdown(); thanks for any responses. jan --- 免责声明(Disclaimer) 1.此电子邮件包含来自神州数码的信息,而且是机密的或者专用的信息。这些信息是供所有以上列出的个人或者团体使用的。如果您不是此邮件的预期收件人,请勿阅读、复制、转发或存储此邮件。如果已误收此邮件,请通知发件人。 This e-mail may contain confidential and/or privileged information from Digital China and is intended solely for the attention and use of the person(s) named above. If you are not the intended recipient (or have received this e-mail in error), please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this email is strictly forbidden. 2.本公司不担保本电子邮件中信息的准确性、适当性或完整性,并且对此产生的任何错误或疏忽不承担任何责任。 The content provided in this e-mail can not be guaranteed and assured to be accurate, appropriate for all, and complete by Digital China, and Digital China can not be held responsible for any error or negligence derived therefrom. 3.接收方应在接收电子邮件或任何附件时检查有无病毒。本公司对由于转载本电子邮件而引发病毒产生的任何损坏不承担任何责任。 The internet communications through this e-mail can not be guaranteed or assured to be error or virus-free, and the sender do not accept liability for any errors, omissions or damages arising therefrom.
Re: no replication using commitWithin via curl?
On 12/29/2014 4:11 PM, Brendan Humphreys wrote: We've noticed that when we send deletes to our SolrCloud cluster via curl with the param commitWithin=1 specified, the deletes are applied and are visible to the leader node, but aren't replicated to other nodes. The problem can be worked around by issuing an explicit (hard) commit. Is this expected behaviour? Can anyone shed light on what is going on here? Another of your messages mentions 4.10.2, which should have the fix for a similar problem reported with a much earlier version, fixed in 4.6.1. https://issues.apache.org/jira/browse/SOLR-5658 There's some confusion around another problem introduced by SOLR-5658 -- SOLR-5762 -- but if you use the latest version, that shouldn't be a problem. If you are running 4.10.2, perhaps SOLR-5658 has come back, or maybe you have multiple versions of the solr jars on your classpath? Thanks, Shawn
Re: poor performance when connecting to CloudSolrServer(zkHosts) using solrJ
On 12/29/2014 6:52 PM, zhangjia...@dcits.com wrote: I setups a SolrCloud, and code a simple solrJ program to query solr data as below, but it takes about 40 seconds to new CloudSolrServer instance,less than 100 miliseconds is acceptable. what is going on when new CloudSolrServer? and how to fix this issue? String zkHost = bicenter1.dcc:2181,datanode2.dcc:2181; String defaultCollection = hdfsCollection; long startms=System.currentTimeMillis(); CloudSolrServer server = new CloudSolrServer(zkHost); server.setDefaultCollection(defaultCollection); server.setZkConnectTimeout(3000); server.setZkClientTimeout(6000); long endms=System.currentTimeMillis(); System.out.println(endms-startms); ModifiableSolrParams params = new ModifiableSolrParams(); params.set(q, id:*hbase*); params.set(sort, price desc); params.set(start, 0); params.set(rows, 10); try { QueryResponse response=server.query(params); SolrDocumentList results = response.getResults(); for (SolrDocument doc:results) { String rowkey=doc.getFieldValue(id).toString(); } } catch (SolrServerException e) { // TODO Auto-generated catch block e.printStackTrace(); } server.shutdown(); The only part of the constructor for CloudSolrServer that I cannot easily look at is the part that creates the httpclient, because ultimately that calls code outside of Solr, in the HttpComponents project. Everything that I *can* see is code that should happen extremely quickly, and the httpclient creation code is something that I have used myself and never had any noticeable delay. The constructor for CloudSolrServer does *NOT* contact zookeeper or Solr, it merely sets up the instance. Nothing is contacted until a request is made. I examined the CloudSolrServer code from branch_5x. I tried out your code (with SolrJ 4.6.0 against a SolrCloud 4.2.1 cluster). Although the query itself encountered an exception in zookeeper (probably from the version discrepancy between Solr and SolrJ), the elapsed time printed out from the CloudSolrServer initialization was 240 milliseconds on the first run, 60 milliseconds on a second run, and 64 milliseconds on a third run. Those are all MUCH less than the 1000 milliseconds that would represent one second, and incredibly less than the 4 milliseconds that would represent 40 seconds. Side issue: I hope that you have more than two zookeeper servers in your ensemble. A two-node zookeeper ensemble is actually *less* reliable than a single node, because a failure of EITHER of those two nodes will result in a loss of quorum. Three nodes is the minimum required for a redundant zookeeper ensemble. Thanks, Shawn
Re: How large is your solr index?
On 12/29/2014 2:30 PM, Toke Eskildsen wrote: At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught on. The only entry is for our (State and University Library, Denmark) setup with 21TB / 7 billion documents on a single machine. To follow my own advice, I can elaborate that we have 1-3 concurrent users and a design goal of median response times below 2 seconds for faceted search. I guess that is at the larger end at the spectrum for pure size, but at the very low end for usage. Off-Topic tangent: I believe it would be useful to organize a session at Lucene Revolution, possibly more interactive than a straight presentation, where users with very large indexes are encouraged to attend. The point of this session would be to exchange war stories, configuration requirements, hardware requirements, and observations. Bringing people with similar goals together to discuss their solutions should be beneficial. The discussions could pinpoint areas where Solr and SolrCloud are weak on scalability, and hopefully lead to issues in Jira and fixes for those problems. Better documentation for extreme scaling is also a possible outcome. Another idea, not sure if it would be good as an alternate idea or supplemental, is a less formal gathering, perhaps over a meal or three. My index is hardly large enough to mention, but I would be interested in attending such a gathering to learn more about the topic. Thanks, Shawn
Re: no replication using commitWithin via curl?
Thanks for the reply Shawn. Yes I am using 4.10.2 - I should have mentioned that in my original post. I can confirm there are not multiple versions of solr in the classpath; Our SolrCloud nodes are built programmatically in AWS using the download package of a specific Solr version as a starting point. I should add that document adds/updates are visible on all nodes very quickly. Its only the deletes that are problematic. Reloading the core on a node brings into into alignment with the leader. I'll dig into the JIRA's you linked to see if there are any hints as to whats going on. Cheers, -Brendan On 30 December 2014 at 12:57, Shawn Heisey apa...@elyograg.org wrote: On 12/29/2014 4:11 PM, Brendan Humphreys wrote: We've noticed that when we send deletes to our SolrCloud cluster via curl with the param commitWithin=1 specified, the deletes are applied and are visible to the leader node, but aren't replicated to other nodes. The problem can be worked around by issuing an explicit (hard) commit. Is this expected behaviour? Can anyone shed light on what is going on here? Another of your messages mentions 4.10.2, which should have the fix for a similar problem reported with a much earlier version, fixed in 4.6.1. https://issues.apache.org/jira/browse/SOLR-5658 There's some confusion around another problem introduced by SOLR-5658 -- SOLR-5762 -- but if you use the latest version, that shouldn't be a problem. If you are running 4.10.2, perhaps SOLR-5658 has come back, or maybe you have multiple versions of the solr jars on your classpath? Thanks, Shawn
Re: How large is your solr index?
On 29 December 2014 at 21:42, Shawn Heisey apa...@elyograg.org wrote: I believe it would be useful to organize a session at Lucene Revolution, possibly more interactive than a straight presentation, where users with very large indexes are encouraged to attend. The point of this session would be to exchange war stories, configuration requirements, hardware requirements, and observations. +1 And have a scribe to take notes with whom to follow-up later :-) And interview separately for Solr podcast too. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/