Oh, doc count over 100M is a very different thing than doc count about
1M. In your original message you said "I tried creating an index with 1M
documents, each with 100 unique terms in a field." If you instead have
100M documents, your use is a couple orders of magnitude larger than mine.
It also occurs to me that while I have around 3 million documents, and
probably up to 50 million or so unique values in the multi-valued
facetted field -- each document only has 3-10 values, not 100 each. So
that may also be a difference that effects the facetting algorithm to
your detriment, not sure.
Prior to Solr 1.4, it was pretty much impossible to facet over 1
million+ unique values at all, now it works wonderfully in many use
cases, but you may have found one that's still too much for it.
It also raises my curiosity as to why you'd want to facet over a
n-grammed field to begin with, that's definitely not an ordinary use
case. Perhaps there is some way to do what you need without facetting?
But you probably know what you're doing.
Jonathan
On 3/16/2011 2:25 PM, Dmitry Kan wrote:
Hi Jonathan,
Thanks for sharing useful bits. Each shard has 16G of heap. Unless I
do something fundamentally wrong in the SOLR configuration, I have to
admit, that counting ngrams up to trigrams across whole set of shard's
documents is pretty intensive task, as each ngram can occur anywhere
in the index and SOLR most probably doesn't precompute the cumulative
count of it. I'll try querying with facet.method=fc, thanks for that.
By the way, the trigrams are defined like this:
<fieldType name="shingle_text_trigram" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
</analyzer>
</fieldType>
For the sharding -- I decided to go with it, when the index size
approached half a terabyte and doc count went over 100M, I thought it
would help us scale better. I also maintain good level of caching, and
so far the faceting over normal string fields (no ngrams) performed
really well (around 1 sec).
On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind <rochk...@jhu.edu
<mailto:rochk...@jhu.edu>> wrote:
Ah, wait, you're doing sharding? Yeah, I am NOT doing sharding,
so that could explain our different experiences. It seems like
sharding definitely has trade-offs, makes some things faster and
other things slower. So far I've managed to avoid it, in the
interest of keeping things simpler and easier to understand (for
me, the developer/Solr manager), thinking that sharding is also a
somewhat less mature feature.
With only 1M documents.... are you sure you need sharding at all?
You could still use replication to "scale out" for volume,
sharding seems more about scaling for number of documents (or
total bytes) in your index. 1M documents is not very large, for
Solr, in general.
Jonathan
On 3/16/2011 11:51 AM, Toke Eskildsen wrote:
On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
Hello guys. We are using shard'ed solr 1.4 for heavy
faceted search over the
trigrams field with about 1 million of entries in the
result set and more
than 100 million of entries to facet on in the index.
Currently the faceted
search is very slow, taking about 5 minutes per query.
I tried creating an index with 1M documents, each with 100
unique terms
in a field. A search for "*:*" with a facet request for the
first 1M
entries in the field took about 20 seconds for the first call
and about
1-1½ second for each subsequent call. This was with Solr
trunk. The
complexity of my setup is no doubt a lot simpler and lighter
than yours,
but 5 minutes sounds excessive.
My guess is that your performance problem is due to the
merging process.
Could you try measuring the performance of a direct request to
a single
shard? If that is satisfactory, going to the cloud would not
solve your
problem. If you really need 1M entries in your result set, you
would be
better of investigating whether your index can be in a single
instance.
--
Regards,
Dmitry Kan