On 1/31/2013 12:47 PM, Mou wrote:
To clarify, the third shard is used to store the recently added/updated
data. Two main big cores take very long to replicate ( when a full
replication is required) so the third one helps us to return the newly
indexed documents quickly. It gets deleted every hour after we replicate the
two other cores with last hour's of new/changed data. This third core is
very small.
I use this approach. My entire index is 74 million documents, but all
new data is added to a shard that only contains about 400K documents.
The other six shards have over 12 million documents each and take up
about 22GB of disk space. It takes two servers to house one complete
copy of my index.
Index updates happen once a minute. Because most delete/reinsert
activity happens on recently added content and all new content gets
added only to the small shard, the large shards can run for many minutes
without seeing commits.
As you said, with that big index and distributed queries , searches were too
slow.So we tried to use the filtercache to speed up the queries. Filtercache
was big as we have thousands of different filters. other caches were not
very helpful as queries are not repetitive and there is heavy add/update to
the index. So we have to use bigger heap size. Now,with that big heap size
GC pauses was horrible, so we moved to Zing jvm. Zing jvm is now using 134 G
of heap and does not have those big pauses but it also does not leave much
memory for OS.
I am now testing with small heap, small filter cache ( just the basic
filters) and lot of memory available for OS disk cache. If that does not
work, I am thinking of breaking my index down into small pieces.
I hope it works for you! With this approach, the first queries will
probably still be pretty slow, but as the data gets cached, things
should speed up.
You can pre-cache the important parts of your index with a command like
the following in the index directory.
cat `ls | egrep -v "(\.fd|\.tv)"` > /dev/null
That command will read all the index files except for the stored fields
(.fdx, .fdt) and termvectors (.tvx, .tvd, .tvf). That puts them in the
OS disk cache. Before trying that command, you would want to find out
how much disk space those files take to make sure they will all fit in
RAM. It is usually a bad idea to schedule this operation in cron.
Thanks,
Shawn