Hello,

           I'm writing a program which will finish lucene searching in
about 12 index directorys, all of them are stored in HDFS. It is done
like this:
1. We get about 12 index Directorys through lucene index
functionality, each of which about 100M size,
2. We store these 12 index directorys on hadoop HDFS , and this hadoop
cluster is made up of one namenode and five datanodes,totally 6
computers.
3. And then I will do lucene searching for these 12 index directorys,
The mapreduce methods are as follows:
    Map Procedure: 12 index directory will be splitted into
numOfMapTasks,for example, if numOfMapTasks=3, then each map we will
get 4 indexDirs and store them in an Intermediate Result.
    Combine Procedure: for a intermediate Result locally, we will do
really lucene search in its containing index directory. and then store
these hit result in the intermediate Result.
    Reduce Procedure: Reduce the Intermediate Results' hit result. and
get the search Result.

But when I implement like this, I have a performance problem, I set
numOfMapTasks and numOfReduceTasks to any value,such as
numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will
spend about 28 seconds, and Obviously It is unacceptable.
So I'm confused whether I did wrong map-reduce procedure or set wrong
num of map or reduce tasks. and generally where the overhead of
mapreduce proceduce will take place. Any suggestion will be
appreciated.
Thanks.

Reply via email to