Re: faceting over ngrams

Jonathan Rochkind Wed, 16 Mar 2011 11:52:13 -0700

Oh, doc count over 100M is a very different thing than doc count about1M. In your original message you said "I tried creating an index with 1Mdocuments, each with 100 unique terms in a field." If you instead have100M documents, your use is a couple orders of magnitude larger than mine.

It also occurs to me that while I have around 3 million documents, andprobably up to 50 million or so unique values in the multi-valuedfacetted field -- each document only has 3-10 values, not 100 each. Sothat may also be a difference that effects the facetting algorithm toyour detriment, not sure.

Prior to Solr 1.4, it was pretty much impossible to facet over 1million+ unique values at all, now it works wonderfully in many usecases, but you may have found one that's still too much for it.

It also raises my curiosity as to why you'd want to facet over an-grammed field to begin with, that's definitely not an ordinary usecase. Perhaps there is some way to do what you need without facetting?But you probably know what you're doing.


Jonathan

On 3/16/2011 2:25 PM, Dmitry Kan wrote:

Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless Ido something fundamentally wrong in the SOLR configuration, I have toadmit, that counting ngrams up to trigrams across whole set of shard'sdocuments is pretty intensive task, as each ngram can occur anywherein the index and SOLR most probably doesn't precompute the cumulativecount of it. I'll try querying with facet.method=fc, thanks for that.


By the way, the trigrams are defined like this:

<fieldType name="shingle_text_trigram" class="solr.TextField"positionIncrementGap="100">

<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>

<filter class="solr.ShingleFilterFactory" maxShingleSize="3"outputUnigrams="true"/>

</analyzer>
</fieldType>

For the sharding -- I decided to go with it, when the index sizeapproached half a terabyte and doc count went over 100M, I thought itwould help us scale better. I also maintain good level of caching, andso far the faceting over normal string fields (no ngrams) performedreally well (around 1 sec).

On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind <rochk...@jhu.edu<mailto:rochk...@jhu.edu>> wrote:


    Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding,
    so that could explain our different experiences.  It seems like
    sharding definitely has trade-offs, makes some things faster and
    other things slower. So far I've managed to avoid it, in the
    interest of keeping things simpler and easier to understand (for
    me, the developer/Solr manager), thinking that sharding is also a
    somewhat less mature feature.

    With only 1M documents.... are you sure you need sharding at all?
     You could still use replication to "scale out" for volume,
    sharding seems more about scaling for number of documents (or
    total bytes) in your index.  1M documents is not very large, for
    Solr, in general.

    Jonathan


    On 3/16/2011 11:51 AM, Toke Eskildsen wrote:

        On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:

            Hello guys. We are using shard'ed solr 1.4 for heavy
            faceted search over the
            trigrams field with about 1 million of entries in the
            result set and more
            than 100 million of entries to facet on in the index.
            Currently the faceted
            search is very slow, taking about 5 minutes per query.

        I tried creating an index with 1M documents, each with 100
        unique terms
        in a field. A search for "*:*" with a facet request for the
        first 1M
        entries in the field took about 20 seconds for the first call
        and about
        1-1½ second for each subsequent call. This was with Solr
        trunk. The
        complexity of my setup is no doubt a lot simpler and lighter
        than yours,
        but 5 minutes sounds excessive.

        My guess is that your performance problem is due to the
        merging process.
        Could you try measuring the performance of a direct request to
        a single
        shard? If that is satisfactory, going to the cloud would not
        solve your
        problem. If you really need 1M entries in your result set, you
        would be
        better of investigating whether your index can be in a single
        instance.




--
Regards,

Dmitry Kan

Re: faceting over ngrams

Reply via email to