Oh, doc count over 100M is a very different thing than doc count about 1M. In your original message you said "I tried creating an index with 1M documents, each with 100 unique terms in a field." If you instead have 100M documents, your use is a couple orders of magnitude larger than mine.

It also occurs to me that while I have around 3 million documents, and probably up to 50 million or so unique values in the multi-valued facetted field -- each document only has 3-10 values, not 100 each. So that may also be a difference that effects the facetting algorithm to your detriment, not sure.

Prior to Solr 1.4, it was pretty much impossible to facet over 1 million+ unique values at all, now it works wonderfully in many use cases, but you may have found one that's still too much for it.

It also raises my curiosity as to why you'd want to facet over a n-grammed field to begin with, that's definitely not an ordinary use case. Perhaps there is some way to do what you need without facetting? But you probably know what you're doing.

Jonathan

On 3/16/2011 2:25 PM, Dmitry Kan wrote:
Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless I do something fundamentally wrong in the SOLR configuration, I have to admit, that counting ngrams up to trigrams across whole set of shard's documents is pretty intensive task, as each ngram can occur anywhere in the index and SOLR most probably doesn't precompute the cumulative count of it. I'll try querying with facet.method=fc, thanks for that.

By the way, the trigrams are defined like this:

<fieldType name="shingle_text_trigram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
</analyzer>
</fieldType>

For the sharding -- I decided to go with it, when the index size approached half a terabyte and doc count went over 100M, I thought it would help us scale better. I also maintain good level of caching, and so far the faceting over normal string fields (no ngrams) performed really well (around 1 sec).


On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind <rochk...@jhu.edu <mailto:rochk...@jhu.edu>> wrote:

    Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding,
    so that could explain our different experiences.  It seems like
    sharding definitely has trade-offs, makes some things faster and
    other things slower. So far I've managed to avoid it, in the
    interest of keeping things simpler and easier to understand (for
    me, the developer/Solr manager), thinking that sharding is also a
    somewhat less mature feature.

    With only 1M documents.... are you sure you need sharding at all?
     You could still use replication to "scale out" for volume,
    sharding seems more about scaling for number of documents (or
    total bytes) in your index.  1M documents is not very large, for
    Solr, in general.

    Jonathan


    On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

        On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

            Hello guys. We are using shard'ed solr 1.4 for heavy
            faceted search over the
            trigrams field with about 1 million of entries in the
            result set and more
            than 100 million of entries to facet on in the index.
            Currently the faceted
            search is very slow, taking about 5 minutes per query.

        I tried creating an index with 1M documents, each with 100
        unique terms
        in a field. A search for "*:*" with a facet request for the
        first 1M
        entries in the field took about 20 seconds for the first call
        and about
        1-1½ second for each subsequent call. This was with Solr
        trunk. The
        complexity of my setup is no doubt a lot simpler and lighter
        than yours,
        but 5 minutes sounds excessive.

        My guess is that your performance problem is due to the
        merging process.
        Could you try measuring the performance of a direct request to
        a single
        shard? If that is satisfactory, going to the cloud would not
        solve your
        problem. If you really need 1M entries in your result set, you
        would be
        better of investigating whether your index can be in a single
        instance.




--
Regards,

Dmitry Kan

Reply via email to