NoSuchMethodError while running FileInputFormat.setInputPaths

2008-06-08 Thread novice user
Hi, I am getting the below error when I was using some one's code where they are using hadoop-17 and have the method FileInputFormat.setInputPaths for setting input paths for the job. The exact error is given below. java.lang.NoSuchMethodError: org.apache.hadoop.mapred.FileInputFormat.setInputP

Re: Maximum number of files in hadoop

2008-06-08 Thread Dhruba Borthakur
The maximum number of files in HDFS depends on the amount of memory available for the namenode. Each file object and each block object takes about 150 bytes of the memory. Thus, if you have 1million files and each file has 1 one block each, then you would need about 3GB of memory for the namenode.

compute document frequency with hadoop-streaming

2008-06-08 Thread xinfan meng
In hadoopstreaming, we accept input from stdin. If we want to compute the document frequncy of words, the somplest way is to output words as keys and file name as values. then how can we get the input file name passed to this MapReduce job? Thanks. -- Best Wishes Meng Xinfan(蒙新泛) Institute of Com

Initial Execution Issue

2008-06-08 Thread Ravi Shankar (Google)
Dear all, I am a newbie started using Haddop yesterday. I am having WIndows XP, and following is the output of grep program, which I had exceuted exactly after following instructions in QuickStart guide:- $ bin/hadoop jar hadoop-0.17.0-examples.jar grp input output 'dfs[a-z.]+' cygpath: cannot

Re: In memory Map Reduce

2008-06-08 Thread Martin Jaggi
Is there some statistics available to monitor which percentage of the pairs remains in memory, and which percentage was written to disk? Or which are these exceptional cases that you mention? Hadoop goes to some lengths to make sure that things can stay in memory as much as possible. Ther

Scalable Custom Subscriptions

2008-06-08 Thread Jason Rutherglen
The in memory optimized Hadoop implementation sounds like it would be useful for a realtime scalable subscription system. The example I'm interested in testing is using Lucene MemoryIndex to execute millions of queries for notification of clients. Where the Hadoop map is a serialized MemoryIndex

Re: Hadoop topology.script.file.name Form

2008-06-08 Thread Yang Chen
Rack Awareness Typically large Hadoop clusters are arranged in *racks* and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition Namenode tries to place replicas of block on multiple racks for improved fault toleranc

RE: Hadoop topology.script.file.name Form

2008-06-08 Thread Devaraj Das
Hi Iver, The implementation of the script depends on your setup. The main thing is that it should be able to accept a bunch of IP addresses and DNS names and be able to give back the rackIDs for each. It is a one-to-one correspondence between what you pass and what you get back. For getting the rac