RE: Accessing files in Hadoop 2.7.2 Distributed Cache
Hi, Siddharth. I was also a bit frustrated at what I found to be scant documentation on how to use the distributed cache in Hadoop 2. The DistributedCache class itself was deprecated in Hadoop 2, but there don’t appear to be very clear instructions on the alternative. I think it’s actually much simpler to work with files on the distributed cache in Hadoop 2. The new way is to add files to the cache (or cacheArchive) via the Job object: job.addCacheFile(uriForYourFile) job.addCacheArchive(uriForYourArchive); The cool part is that, if you set up your URI so that it has a “#yourFileReference” at the end, then Hadoop will set up a symbolic link named “yourFileReference” in your job’s working directory, which you can use to get at the file or archive. So, it’s as if the file or archive is in the working directory. That obviates the need to even work with the DistributedCache class in your Mapper or Reducer, since you can just work with the file (or path using nio) directly. Hope that helps. -Jeff From: Siddharth Dawar [mailto:siddharthdawa...@gmail.com] Sent: Tuesday, June 07, 2016 4:06 AM To: user@hadoop.apache.org Subject: Accessing files in Hadoop 2.7.2 Distributed Cache Hi, I want to use the distributed cache to allow my mappers to access data in Hadoop 2.7.2. In main, I'm using the command String hdfs_path="hdfs://localhost:9000/bloomfilter"; InputStream in = new BufferedInputStream(new FileInputStream("/home/siddharth/Desktop/data/bloom_filter")); Configuration conf = new Configuration(); fs = FileSystem.get(java.net.URI.create(hdfs_path), conf); OutputStream out = fs.create(new Path(hdfs_path)); //Copy file from local to HDFS IOUtils.copyBytes(in, out, 4096, true); System.out.println(hdfs_path + " copied to HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); The above code adds a file present on my local file system to HDFS and adds it to the distributed cache. However, in my mapper code, when I try to access the file stored in distributed cache, the Path[] P variable gets null value. d public void configure(JobConf conf) { this.conf = conf; try { Path [] p=DistributedCache.getLocalCacheFiles(conf); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } Even when I tried to access distributed cache from the following code in my mapper, the code returns the error that bloomfilter file doesn't exist strm = new DataInputStream(new FileInputStream("bloomfilter")); // Read into our Bloom filter. filter.readFields(strm); strm.close(); However, I read somewhere that if we add a file to distributed cache, we can access it directly from its name. Can you please help me out ?
Can the Backup Node be deployed when dfs.http.policy is HTTPS_ONLY?
Hello, I'm attempting to deploy a Backup Node [1] on a dev cluster where we specify that all HTTP communication must happen over SSL (dfs.http.policy = HTTPS_ONLY). The Backup Node fails to start with this exception: 2016-06-07 14:01:01,243 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. java.lang.NullPointerException at org.apache.hadoop.net.NetUtils.getHostPortString(NetUtils.java:651) at org.apache.hadoop.hdfs.server.namenode.NameNode.setRegistration(NameNode.java:617) at org.apache.hadoop.hdfs.server.namenode.BackupNode.registerWith(BackupNode.java:366) at org.apache.hadoop.hdfs.server.namenode.BackupNode.initialize(BackupNode.java:162) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:838) at org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1519) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1606) Earlier in the startup logs for the backup node, I see this line: 2016-06-07 14:01:00,736 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at:https://0.0.0.0:50470 My understanding of this is that instead of running at the URL specified by dfs.namenode.backup.http-address, the Backup Node's web server is running at the URL specified by dfs.namenode.https-address. This seems consistent with the behavior of DFSUtils#httpServerTemplateForNNAndJN. Then because the web server is not running at the expected HTTP URL, the call to getHttpAddress() inNameNode#setRegistration finds a null value for the httpAddress variable in NameNodeHttpServer. When I set dfs.http.policy to HTTP_AND_HTTPS, the Backup Node happily starts up with its web server running at the URL specified by dfs.namenode.backup.http-address. So my question is: Is it possible to deploy a Backup Node on a cluster where all HTTP communication must happen over SSL, and if so, how can I fix my configurations? If not, are there plans to support HTTPS for the Backup Node, and would it make sense for me to file a ticket (I couldn't find an existing one)? Thanks, Steve [1] https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Backup_Node
Unsubscribe
Re: Usage of data node to run on commodity hardware
Hi Krishna! I don't see why you couldn't start Hadoop in this configuration. Performance would obviously be suspect. Maybe by configuring your network toppology script, you could even improve the performance. Most mobiles are ARM processor. I know some cool people ran Hadoop v1 on Raspberry Pis (also ARM), but I don't know if Hadoop's performance optimized native code has been run successfully on ARM. (Hadoop will use the native binaries if they are available, otherwise fall back on JAVA implementations) HTH Ravi On Mon, Jun 6, 2016 at 7:41 PM, Krishna < ramakrishna.srinivas.mur...@gmail.com> wrote: > Hi All, > > I am new to hadoop and I am thinking of requirement don't know whether it > is feasible or not. I want to run hadoop on non-cluster environment means I > want to run it on commodity hardware. I have one desktop machine with > higher CPU and memory configuration, and i have close to 20 laptops and all > are connected in same network through wire or wireless connection. I want > to use desktop machine as name node and 20 laptop as data nodes. Will that > be possible? > > Extend to it is there any requirement for data node in terms of system > configuration? Now days mobiles are also coming with good RAM and CPU, can > we use mobiles as a data node provided Java is installed in mobile? > > Thanks > Ramakrishna S >
[no subject]
unsubscribe
Unsubscribe
Re: How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2
If you use the Instance of Job class, you can add files to distributed cache like this: Job job = Job.getInstanceOf(conf); job.addCacheFiles(filepath); Sent from my iPhone > On Jun 7, 2016, at 5:17 AM, Siddharth Dawar> wrote: > > Hi, > > I wrote a program which creates Map-Reduce jobs in an iterative fashion as > follows: > > > while (true) > { > JobConf conf2 = new JobConf(getConf(),graphMining.class); > > conf2.setJobName("sid"); > conf2.setMapperClass(mapperMiner.class); > conf2.setReducerClass(reducerMiner.class); > > conf2.setInputFormat(SequenceFileInputFormat.class); > conf2.setOutputFormat(SequenceFileOutputFormat.class); > conf2.setOutputValueClass(BytesWritable.class); > > conf2.setMapOutputKeyClass(Text.class); > conf2.setMapOutputValueClass(MapWritable.class); > conf2.setOutputKeyClass(Text.class); > > conf2.setNumMapTasks(Integer.parseInt(args[3])); > conf2.setNumReduceTasks(Integer.parseInt(args[4])); > FileInputFormat.addInputPath(conf2, new Path(input)); > FileOutputFormat.setOutputPath(conf2, new Path(output)); } > RunningJob job = JobClient.runJob(conf2); > } > > Now, I want the first Job which gets created to write something in the > distributed cache and the jobs which get created after the first job to read > from the distributed cache. > > I came to know that the DistributedCache.addcacheFiles() method is > deprecated, so the documentation suggests to use Job.addcacheFiles() method > specific for each job. > > But, I am unable to get an handle of the currently running job, as > JobClient.runJob(conf2) submits a job internally. > > > How can I share the content written by the first job in this while loop > available via distributed cache to other jobs which get created in later > iterations of while loop ? >
How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2
Hi, I wrote a program which creates Map-Reduce jobs in an iterative fashion as follows: while (true) { JobConf conf2 = new JobConf(getConf(),graphMining.class); conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class); conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2, new Path(input));FileOutputFormat.setOutputPath(conf2, new Path(output)); } RunningJob job = JobClient.runJob(conf2); } Now, I want the first Job which gets created to write something in the distributed cache and the jobs which get created after the first job to read from the distributed cache. I came to know that the DistributedCache.addcacheFiles() method is deprecated, so the documentation suggests to use Job.addcacheFiles() method specific for each job. But, I am unable to get an handle of the currently running job, as JobClient.runJob(conf2) submits a job internally. How can I share the content written by the first job in this while loop available via distributed cache to other jobs which get created in later iterations of while loop ?
Accessing files in Hadoop 2.7.2 Distributed Cache
Hi, I want to use the distributed cache to allow my mappers to access data in Hadoop 2.7.2. In main, I'm using the command String hdfs_path="hdfs://localhost:9000/bloomfilter";InputStream in = new BufferedInputStream(new FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));Configuration conf = new Configuration();fs = FileSystem.get(java.net.URI.create(hdfs_path), conf);OutputStream out = fs.create(new Path(hdfs_path)); //Copy file from local to HDFSIOUtils.copyBytes(in, out, 4096, true); System.out.println(hdfs_path + " copied to HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); The above code adds a file present on my local file system to HDFS and adds it to the distributed cache. However, in my mapper code, when I try to access the file stored in distributed cache, the Path[] P variable gets null value. d public void configure(JobConf conf) { this.conf = conf; try { Path [] p=DistributedCache.getLocalCacheFiles(conf);} catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } Even when I tried to access distributed cache from the following code in my mapper, the code returns the error that bloomfilter file doesn't exist strm = new DataInputStream(new FileInputStream("bloomfilter"));// Read into our Bloom filter.filter.readFields(strm);strm.close(); However, I read somewhere that if we add a file to distributed cache, we can access it directly from its name. Can you please help me out ?