Re: 400 MB Fields

Lance Norskog Tue, 07 Jun 2011 18:27:06 -0700

The Salesforce book is 2800 pages of PDF, last I looked.

What can you do with a field that big? Can you get all of the snippets?


On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi <[email protected]> wrote:
> Hi Otis,
>
>
> I am recalling "pagination" feature, it is still unresolved (with default
> scoring implementation): even with small documents, searching-retrieving
> documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
> take few minutes (I saw it with trunk version 6 months ago, and with very
> small documents, total 100 mlns docs); it is advisable to restrict search
> results to top-1000 in any case (as with Google)...
>
>
>
> I believe things can get wrong; yes, most plain-text retrieved from books
> should be 2kb per page, 500 pages, :=> 1,000,000 bytes (or double it for
> UTF-8)
>
> Theoretically, it doesn't make any sense to index BIG document containing
> all terms from dictionary without any "terms frequency" calcs, but even
> with it... I can't imagine we should index 1000s docs and each is just
> (different) version of whole Wikipedia, should be wrong design...
>
> Ok, use case: index single HUGE document. What will we do? Create index
> with _the_only_ document? And all search will return the same result (or
> nothing)? Paginate it; split into pages. I am pragmatic...
>
>
> Fuad
>
>
>
> On 11-06-07 8:04 PM, "Otis Gospodnetic" <[email protected]> wrote:
>
>>Hi,
>>
>>
>>> I think the question is strange... May be you are wondering about
>>>possible
>>> OOM exceptions?
>>
>>No, that's an easier one. I was more wondering whether with 400 MB Fields
>>(indexed, not stored) it becomes incredibly slow to:
>>* analyze
>>* commit / write to disk
>>* search
>>
>>> I think we can pass to Lucene single document  containing
>>> comma separated list of "term, term, ..." (few billion times)...  Except
>>> "stored" and "TermVectorComponent"...
>
>
>



-- 
Lance Norskog
[email protected]

Re: 400 MB Fields

Reply via email to