On 1/31/2013 12:47 PM, Mou wrote:
To clarify, the third shard is used to store the recently added/updated
data. Two main big cores take very long to replicate ( when a full
replication is required) so the third one helps us to return the newly
indexed documents quickly. It gets deleted every hour after we replicate the
two other cores with last hour's of new/changed data. This third core is
very small.

I use this approach. My entire index is 74 million documents, but all new data is added to a shard that only contains about 400K documents. The other six shards have over 12 million documents each and take up about 22GB of disk space. It takes two servers to house one complete copy of my index.

Index updates happen once a minute. Because most delete/reinsert activity happens on recently added content and all new content gets added only to the small shard, the large shards can run for many minutes without seeing commits.

As you said, with that big index and distributed queries , searches were too
slow.So we tried to use the filtercache to speed up the queries. Filtercache
was big as we have thousands of different filters. other caches were not
very helpful as queries are not repetitive and there is heavy add/update to
the index. So we have to use bigger heap size. Now,with that big heap size
GC pauses was horrible, so we moved to Zing jvm. Zing jvm is now using 134 G
of heap and does not have those big pauses but it also does not leave much
memory for OS.

I am now testing with small heap, small filter cache ( just the basic
filters) and lot of memory available for OS disk cache. If that does not
work, I am thinking of breaking my index down into small pieces.

I hope it works for you! With this approach, the first queries will probably still be pretty slow, but as the data gets cached, things should speed up.

You can pre-cache the important parts of your index with a command like the following in the index directory.

cat `ls | egrep -v "(\.fd|\.tv)"` > /dev/null

That command will read all the index files except for the stored fields (.fdx, .fdt) and termvectors (.tvx, .tvd, .tvf). That puts them in the OS disk cache. Before trying that command, you would want to find out how much disk space those files take to make sure they will all fit in RAM. It is usually a bad idea to schedule this operation in cron.

Thanks,
Shawn

Reply via email to