Oh, P.S. Solr is a great search engine, but it's certainly not the perfect answer to all problems. Mayhap you've hit on a case where it isn't the best solution....
On Sat, Jul 6, 2013 at 8:22 AM, Erick Erickson <erickerick...@gmail.com>wrote: > What does this have to to with removing the > int32 limit? It's still the same problem if you have > a high start parameter, it's the "deep paging" > issue that's part of Solr. > > I know there has been work on this (you'll have to > search the JIRAs), the basic idea is that you pass > enough information that you don't have to keep an > enormous sorted list to get to the 500,000th > document. Don't know if it's committed, and it has > its own problems if the index changes out from > underneath. > > But, if you have a relatively static index, you can > increase your queryResultCache. That cache just > stores the query and the doc IDs of the results and > should make paging much faster at the expense, > of course, of memory. > > Although the streaming idea is interesting, I admit > I haven't played with that yet. > > Best > Erick > > > On Fri, Jul 5, 2013 at 4:28 PM, Valery Giner <valgi...@research.att.com>wrote: > >> Eric, >> >> We did not have any RAM problems, but just the following official >> limitation makes our life too miserable to use the shards: >> >> "Makes it more inefficient to use a high "start" parameter. For example, >> if you request start=500000&rows=25 on an index with 500,000+ docs per >> shard, this will currently result in 500,000 records getting sent over the >> network from the shard to the coordinating Solr instance. If you had a >> single-shard index, in contrast, only 25 records would ever get sent over >> the network. (Granted, setting start this high is not something many people >> need to do.) " >> http://wiki.apache.org/solr/**DistributedSearch<http://wiki.apache.org/solr/DistributedSearch> >> >> Reading millions of documents as a result of a query is a "normal" use >> case for us, not a "design defect". Subdividing the "large" indexes into >> smaller ones seems too ugly to use as a way to scale up. This turns solr >> from a perfect solution for us into something unacceptable for such cases. >> >> I wonder whether any one else has similar use cases/problem with sharding. >> >> Thanks, >> Val >> >> >> On 05/03/2013 12:10 PM, Erick Erickson wrote: >> >>> My off the cuff thought is that there are significant costs trying to >>> do this that would be paid by 99.999% of setups out there. Also, >>> usually you'll run into other issues (RAM etc) long before you come >>> anywhere close to 2^31 docs. >>> >>> Lucene/Solr often allocates int[maxDoc] for various operations. when >>> maxDoc approaches 2^31, well memory goes through the roof. Now >>> consider allocating longs instead... >>> >>> which is a long way of saying that I don't really think anyone's going >>> to be working on this any time soon, especially when SolrCloud removes >>> a LOT of the pain /complexity (from a user perspective anyway) from >>> going to a sharded setup... >>> >>> FWIW, >>> Erick >>> >>> On Thu, May 2, 2013 at 1:17 PM, Valery Giner <valgi...@research.att.com> >>> wrote: >>> >>>> Otis, >>>> >>>> The documents themselves are relatively small, tens of fields, only a >>>> few of >>>> them could be up to a hundred bytes. >>>> Lunix Servers with relatively large RAM (256), >>>> Minutes on the searches are fine for our purposes, adding a few tens of >>>> millions of records in tens of minutes are also fine. >>>> We had to do some simple tricks for keeping indexing up to speed but >>>> nothing >>>> too fancy. >>>> Moving to the sharding adds a layer of complexity which we don't really >>>> need >>>> because of the above, ... and adding complexity may result in lower >>>> reliability :) >>>> >>>> Thanks, >>>> Val >>>> >>>> >>>> On 05/02/2013 03:41 PM, Otis Gospodnetic wrote: >>>> >>>>> Val, >>>>> >>>>> Haven't seen this mentioned in a while... >>>>> >>>>> I'm curious...what sort of index, queries, hardware, and latency >>>>> requirements do you have? >>>>> >>>>> Otis >>>>> Solr & ElasticSearch Support >>>>> http://sematext.com/ >>>>> On May 1, 2013 4:36 PM, "Valery Giner" <valgi...@research.att.com> >>>>> wrote: >>>>> >>>>> Dear Solr Developers, >>>>>> >>>>>> I've been unable to find an answer to the question in the subject >>>>>> line of >>>>>> this e-mail, except of a vague one. >>>>>> >>>>>> We need to be able to index over 2bln+ documents. We were doing well >>>>>> without sharding until the number of docs hit the limit ( 2bln+). >>>>>> The >>>>>> performance was satisfactory for the queries, updates and indexing of >>>>>> new >>>>>> documents. >>>>>> >>>>>> That is, except for the need to go around the int32 limit, we don't >>>>>> really >>>>>> have a need for setting up distributed solr. >>>>>> >>>>>> I wonder whether some one on the solr team could tell us when/what >>>>>> version >>>>>> of solr we could expect the limit to be removed. >>>>>> >>>>>> I hope this question may be of interest to some one else :) >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> Val >>>>>> >>>>>> >>>>>> >> >