Processing files lying in a directory structure

2009-06-04 Thread akhil1988
Hi! I am working on applying WordCount example on the entire Wikipedia dump. The entire english wikipedia is around 200GB which I have stored in HDFS in a cluster to which I have access. The problem: Wikipedia dump contains many directories (it has a very big directory structure) containing

Giving classpath in hadoop jar command i.e. while executing a mapreduce job

2009-06-05 Thread akhil1988
I wish to give a path of a jar file as an argument when executing the hadoop jar . command as my mapper uses that jar file for its operation. I found that -libjars option can be used but for me it is not working, it is giving an exception. Can anyone tell, how to use libjars generic command

Implementing CLient-Server architecture using MapReduce

2009-06-07 Thread akhil1988
Hi All, I am porting a machine learning application on Hadoop using MapReduce. The architecture of the application goes like this: 1. run a number of server processes which take around 2-3 minutes to start and then remain as daemon waiting for a client to call for a connection. During the

Re: Implementing CLient-Server architecture using MapReduce

2009-06-08 Thread akhil1988
Can anyone help me on this issue. I have an account on the cluster and I cannot go and start server on each server process on each tasktracker. Akhil akhil1988 wrote: Hi All, I am porting a machine learning application on Hadoop using MapReduce. The architecture of the application goes

Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread akhil1988
Hi All, I am running my mapred program in local mode by setting mapred.jobtracker.local to local mode so that I can debug my code. The mapred program is a direct porting of my original sequential code. There is no reduce phase. Basically, I have just put my program in the map class. My program

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread akhil1988
(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-17 Thread akhil1988
DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-17 Thread akhil1988
: DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); The program executes till the same point as before now also and terminates. That means

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-18 Thread akhil1988
that I would like to ask you is that can we use DistributerCache for transferring directories to the local cache of the tasks? Thanks, Akhil akhil1988 wrote: Hi Jason! Thanks for going with me to solve my problem. To restate things and make it more easier to understand: I am working

Strange Exeception

2009-06-22 Thread akhil1988
wordcount_classes_dir.jar org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input /home/akhil1988/output JO 09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException

Re: Strange Exeception

2009-06-23 Thread akhil1988
it is only created at cluster start time. On Mon, Jun 22, 2009 at 6:19 PM, akhil1988 akhilan...@gmail.com wrote: Hi All! I have been running Hadoop jobs through my user account on a cluster, for a while now. But now I am getting this strange exception when I try to execute a job

Using addCacheArchive

2009-06-25 Thread akhil1988
Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped

Re: Using addCacheArchive

2009-06-25 Thread akhil1988
Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI

Re: Using addCacheArchive

2009-06-25 Thread akhil1988
Thanks Amareshwari for your reply! The file Config.zip is lying in the HDFS, if it would not have been then the error would be reported by the jobtracker itself while executing the statement: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); But I get error in the map

Re: Using addCacheArchive

2009-06-25 Thread akhil1988
Yes, my HDFS paths are of the form /home/user-name/ And I have used these in DistributedCache's addCacheFiles method successfully. Thanks, Akhil Amareshwari Sriramadasu wrote: Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the form /user/akhil1988/Config.zip. Just

Re: Using addCacheArchive

2009-06-26 Thread akhil1988
, akhil1988 akhilan...@gmail.com wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements

Archives not getting unarchived at tasktrackers

2009-06-27 Thread akhil1988
Hi All, I am using DistributedCache.addCacheArchives() to distribute a tar file to the tasktrackers using the following statement. DistributedCache.addCacheArchives(new URI(/home/akhil1988/sample.tar), conf); According to the documentation it should get unarchived at the tasktrackers