Re: any plans to remove int32 limitation on the number of the documents in the index?

Erick Erickson Sat, 06 Jul 2013 05:25:05 -0700

Oh, P.S. Solr is a great search engine, but it's
certainly not the perfect answer to all problems.
Mayhap you've hit on a case where it isn't the
best solution....



On Sat, Jul 6, 2013 at 8:22 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> What does this have to to with removing the
> int32 limit? It's still the same problem if you have
> a high start parameter, it's the "deep paging"
> issue that's part of Solr.
>
> I know there has been work on this (you'll have to
> search the JIRAs), the basic idea is that you pass
> enough information that you don't have to keep an
> enormous sorted list to get to the 500,000th
> document. Don't know if it's committed, and it has
> its own problems if the index changes out from
> underneath.
>
> But, if you have a relatively static index, you can
> increase your queryResultCache. That cache just
> stores the query and the doc IDs of the results and
> should make paging much faster at the expense,
> of course, of memory.
>
> Although the streaming idea is interesting, I admit
> I haven't played with that yet.
>
> Best
> Erick
>
>
> On Fri, Jul 5, 2013 at 4:28 PM, Valery Giner <valgi...@research.att.com>wrote:
>
>> Eric,
>>
>> We did not have any RAM problems, but just the following official
>> limitation makes our life too miserable to use the shards:
>>
>> "Makes it more inefficient to use a high "start" parameter. For example,
>> if you request start=500000&rows=25 on an index with 500,000+ docs per
>> shard, this will currently result in 500,000 records getting sent over the
>> network from the shard to the coordinating Solr instance. If you had a
>> single-shard index, in contrast, only 25 records would ever get sent over
>> the network. (Granted, setting start this high is not something many people
>> need to do.) "  
>> http://wiki.apache.org/solr/**DistributedSearch<http://wiki.apache.org/solr/DistributedSearch>
>>
>>  Reading millions of documents as a result of a query is a "normal" use
>> case for us, not a "design defect".   Subdividing the "large" indexes into
>> smaller ones seems too ugly to use as a way to scale up.  This turns solr
>> from a perfect solution for us into something unacceptable for such cases.
>>
>> I wonder whether any one else has similar use cases/problem with sharding.
>>
>> Thanks,
>> Val
>>
>>
>> On 05/03/2013 12:10 PM, Erick Erickson wrote:
>>
>>> My off the cuff thought is that there are significant costs trying to
>>> do this that would be paid by 99.999% of setups out there. Also,
>>> usually you'll run into other issues (RAM etc) long before you come
>>> anywhere close to 2^31 docs.
>>>
>>> Lucene/Solr often allocates int[maxDoc] for various operations. when
>>> maxDoc approaches 2^31, well memory goes through the roof. Now
>>> consider allocating longs instead...
>>>
>>> which is a long way of saying that I don't really think anyone's going
>>> to be working on this any time soon, especially when SolrCloud removes
>>> a LOT of the pain /complexity (from a user perspective anyway) from
>>> going to a sharded setup...
>>>
>>> FWIW,
>>> Erick
>>>
>>> On Thu, May 2, 2013 at 1:17 PM, Valery Giner <valgi...@research.att.com>
>>> wrote:
>>>
>>>> Otis,
>>>>
>>>> The documents themselves are relatively small, tens of fields, only a
>>>> few of
>>>> them could be up to a hundred bytes.
>>>> Lunix Servers with relatively large RAM (256),
>>>> Minutes on the searches are fine for our purposes,  adding a few tens of
>>>> millions of records in tens of minutes are also fine.
>>>> We had to do some simple tricks for keeping indexing up to speed but
>>>> nothing
>>>> too fancy.
>>>> Moving to the sharding adds a layer of complexity which we don't really
>>>> need
>>>> because of the above, ... and adding complexity may result in lower
>>>> reliability :)
>>>>
>>>> Thanks,
>>>> Val
>>>>
>>>>
>>>> On 05/02/2013 03:41 PM, Otis Gospodnetic wrote:
>>>>
>>>>> Val,
>>>>>
>>>>> Haven't seen this mentioned in a while...
>>>>>
>>>>> I'm curious...what sort of index, queries, hardware, and latency
>>>>> requirements do you have?
>>>>>
>>>>> Otis
>>>>> Solr & ElasticSearch Support
>>>>> http://sematext.com/
>>>>> On May 1, 2013 4:36 PM, "Valery Giner" <valgi...@research.att.com>
>>>>> wrote:
>>>>>
>>>>>  Dear Solr Developers,
>>>>>>
>>>>>> I've been unable to find an answer to the question in the subject
>>>>>> line of
>>>>>> this e-mail, except of a vague one.
>>>>>>
>>>>>> We need to be able to index over 2bln+ documents.   We were doing well
>>>>>> without sharding until the number of docs hit the limit ( 2bln+).
>>>>>> The
>>>>>> performance was satisfactory for the queries, updates and indexing of
>>>>>> new
>>>>>> documents.
>>>>>>
>>>>>> That is, except for the need to go around the int32 limit, we don't
>>>>>> really
>>>>>> have a need for setting up distributed solr.
>>>>>>
>>>>>> I wonder whether some one on the solr team could tell us when/what
>>>>>> version
>>>>>> of solr we could expect the limit to be removed.
>>>>>>
>>>>>> I hope this question may be of interest to some one else :)
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Val
>>>>>>
>>>>>>
>>>>>>
>>
>

Re: any plans to remove int32 limitation on the number of the documents in the index?

Reply via email to