RE: Accessing files in Hadoop 2.7.2 Distributed Cache

2016-06-07 Thread Guttadauro, Jeff
Hi, Siddharth.

I was also a bit frustrated at what I found to be scant documentation on how to 
use the distributed cache in Hadoop 2.  The DistributedCache class itself was 
deprecated in Hadoop 2, but there don’t appear to be very clear instructions on 
the alternative.  I think it’s actually much simpler to work with files on the 
distributed cache in Hadoop 2.  The new way is to add files to the cache (or 
cacheArchive) via the Job object:

job.addCacheFile(uriForYourFile)
job.addCacheArchive(uriForYourArchive);

The cool part is that, if you set up your URI so that it has a 
“#yourFileReference” at the end, then Hadoop will set up a symbolic link named 
“yourFileReference” in your job’s working directory, which you can use to get 
at the file or archive.  So, it’s as if the file or archive is in the working 
directory.  That obviates the need to even work with the DistributedCache class 
in your Mapper or Reducer, since you can just work with the file (or path using 
nio) directly.

Hope that helps.
-Jeff
From: Siddharth Dawar [mailto:siddharthdawa...@gmail.com]
Sent: Tuesday, June 07, 2016 4:06 AM
To: user@hadoop.apache.org
Subject: Accessing files in Hadoop 2.7.2 Distributed Cache

Hi,
I want to use the distributed cache to allow my mappers to access data in 
Hadoop 2.7.2. In main, I'm using the command

String hdfs_path="hdfs://localhost:9000/bloomfilter";

InputStream in = new BufferedInputStream(new 
FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));

Configuration conf = new Configuration();

fs = FileSystem.get(java.net.URI.create(hdfs_path), conf);

OutputStream out = fs.create(new Path(hdfs_path));



//Copy file from local to HDFS

IOUtils.copyBytes(in, out, 4096, true);



System.out.println(hdfs_path + " copied to 
HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);

DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);



The above code adds a file present on my local file system to HDFS and adds it 
to the distributed cache.


However, in my mapper code, when I try to access the file stored in distributed 
cache, the Path[] P variable gets null value. d

public void configure(JobConf conf)

   {

   this.conf = conf;

   try {

  Path [] 
p=DistributedCache.getLocalCacheFiles(conf);

   } catch (IOException e) {

  // TODO Auto-generated catch block

  e.printStackTrace();

   }







   }

Even when I tried to access distributed cache from the following code

in my mapper, the code returns the error that bloomfilter file doesn't exist

strm = new DataInputStream(new FileInputStream("bloomfilter"));

// Read into our Bloom filter.

filter.readFields(strm);

strm.close();

However, I read somewhere that if we add a file to distributed cache, we can 
access it

directly from its name.

Can you please help me out ?



Can the Backup Node be deployed when dfs.http.policy is HTTPS_ONLY?

2016-06-07 Thread Steven Rand
Hello,

I'm attempting to deploy a Backup Node [1] on a dev cluster where we
specify that all HTTP communication must happen over SSL (dfs.http.policy =
HTTPS_ONLY). The Backup Node fails to start with this exception:

2016-06-07 14:01:01,243 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.NullPointerException
at org.apache.hadoop.net.NetUtils.getHostPortString(NetUtils.java:651)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.setRegistration(NameNode.java:617)
at
org.apache.hadoop.hdfs.server.namenode.BackupNode.registerWith(BackupNode.java:366)
at
org.apache.hadoop.hdfs.server.namenode.BackupNode.initialize(BackupNode.java:162)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:838)
at
org.apache.hadoop.hdfs.server.namenode.BackupNode.(BackupNode.java:89)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1519)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1606)

Earlier in the startup logs for the backup node, I see this line:

2016-06-07 14:01:00,736 INFO org.apache.hadoop.hdfs.DFSUtil: Starting
Web-server for hdfs at:https://0.0.0.0:50470

My understanding of this is that instead of running at the URL specified by
dfs.namenode.backup.http-address, the Backup Node's web server is running
at the URL specified by dfs.namenode.https-address. This seems consistent
with the behavior of DFSUtils#httpServerTemplateForNNAndJN. Then because
the web server is not running at the expected HTTP URL, the call to
getHttpAddress() inNameNode#setRegistration finds a null value for the
httpAddress variable in NameNodeHttpServer.

When I set dfs.http.policy to HTTP_AND_HTTPS, the Backup Node happily
starts up with its web server running at the URL specified by
dfs.namenode.backup.http-address. So my question is: Is it possible to
deploy a Backup Node on a cluster where all HTTP communication must happen
over SSL, and if so, how can I fix my configurations? If not, are there
plans to support HTTPS for the Backup Node, and would it make sense for me
to file a ticket (I couldn't find an existing one)?

Thanks,
Steve


[1]
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Backup_Node


Unsubscribe

2016-06-07 Thread Whit Waldo



Re: Usage of data node to run on commodity hardware

2016-06-07 Thread Ravi Prakash
Hi Krishna!

I don't see why you couldn't start Hadoop in this configuration.
Performance would obviously be suspect. Maybe by configuring your network
toppology script, you could even improve the performance.

Most mobiles are ARM processor. I know some cool people ran Hadoop v1 on
Raspberry Pis (also ARM), but I don't know if Hadoop's performance
optimized native code has been run successfully on ARM. (Hadoop will use
the native binaries if they are available, otherwise fall back on JAVA
implementations)

HTH
Ravi

On Mon, Jun 6, 2016 at 7:41 PM, Krishna <
ramakrishna.srinivas.mur...@gmail.com> wrote:

> Hi All,
>
> I am new to hadoop and I am thinking of requirement  don't know whether it
> is feasible or not. I want to run hadoop on non-cluster environment means I
> want to run it on commodity hardware. I have one desktop machine with
> higher CPU and memory configuration, and i have close to 20 laptops and all
> are connected in same network through wire or wireless connection. I want
> to use desktop machine as name node and 20 laptop as data nodes.  Will that
> be possible?
>
> Extend to it is there any requirement for data node in terms of system
> configuration? Now days mobiles are also coming with good RAM and CPU, can
> we use mobiles as a data node provided Java is installed in mobile?
>
> Thanks
> Ramakrishna S
>


[no subject]

2016-06-07 Thread Anit Alexander's i
unsubscribe


Unsubscribe

2016-06-07 Thread Dhanashri Desai



Re: How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2

2016-06-07 Thread Arun Natva
If you use the Instance of Job class, you can add files to distributed cache 
like this:
Job job = Job.getInstanceOf(conf);
job.addCacheFiles(filepath);


Sent from my iPhone

> On Jun 7, 2016, at 5:17 AM, Siddharth Dawar  
> wrote:
> 
> Hi,
> 
> I wrote a program which creates Map-Reduce jobs in an iterative fashion as 
> follows:
> 
> 
> while (true) 
> {
> JobConf conf2  = new JobConf(getConf(),graphMining.class);
> 
> conf2.setJobName("sid");
> conf2.setMapperClass(mapperMiner.class);
> conf2.setReducerClass(reducerMiner.class);
> 
> conf2.setInputFormat(SequenceFileInputFormat.class);
> conf2.setOutputFormat(SequenceFileOutputFormat.class);
> conf2.setOutputValueClass(BytesWritable.class);
> 
> conf2.setMapOutputKeyClass(Text.class);
> conf2.setMapOutputValueClass(MapWritable.class);
> conf2.setOutputKeyClass(Text.class);
> 
> conf2.setNumMapTasks(Integer.parseInt(args[3]));
> conf2.setNumReduceTasks(Integer.parseInt(args[4]));
> FileInputFormat.addInputPath(conf2, new Path(input));
> FileOutputFormat.setOutputPath(conf2, new Path(output)); }
> RunningJob job = JobClient.runJob(conf2);
> }
> 
> Now, I want the first Job which gets created to write something in the 
> distributed cache and the jobs which get created after the first job to read 
> from the distributed cache. 
> 
> I came to know that the DistributedCache.addcacheFiles() method is 
> deprecated, so the documentation suggests to use Job.addcacheFiles() method 
> specific for each job.
> 
> But, I am unable to get an handle of the currently running job, as 
> JobClient.runJob(conf2) submits a job internally.
> 
> 
> How can I share the content written by the first job in this while loop 
> available via distributed cache to other jobs which get created in later 
> iterations of while loop ? 
> 


How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2

2016-06-07 Thread Siddharth Dawar
Hi,

I wrote a program which creates Map-Reduce jobs in an iterative fashion as
follows:


while (true) {

JobConf conf2  = new JobConf(getConf(),graphMining.class);
conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class);

conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2,
new Path(input));FileOutputFormat.setOutputPath(conf2, new
Path(output)); }

RunningJob job = JobClient.runJob(conf2);
}


Now, I want the first Job which gets created to write something in the
distributed cache and the jobs which get created after the first job to
read from the distributed cache.

I came to know that the DistributedCache.addcacheFiles() method is
deprecated, so the documentation suggests to use Job.addcacheFiles() method
specific for each job.

But, I am unable to get an handle of the currently running job, as
JobClient.runJob(conf2) submits a job internally.


How can I share the content written by the first job in this while loop
available via distributed cache to other jobs which get created in later
iterations of while loop ?


Accessing files in Hadoop 2.7.2 Distributed Cache

2016-06-07 Thread Siddharth Dawar
Hi,

I want to use the distributed cache to allow my mappers to access data in
Hadoop 2.7.2. In main, I'm using the command

String hdfs_path="hdfs://localhost:9000/bloomfilter";InputStream in =
new BufferedInputStream(new
FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));Configuration
conf = new Configuration();fs =
FileSystem.get(java.net.URI.create(hdfs_path), conf);OutputStream out
= fs.create(new Path(hdfs_path));   
 //Copy file from local to
HDFSIOUtils.copyBytes(in, out, 4096, true); 

System.out.println(hdfs_path + " copied to
HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(),
conf2);

DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);


The above code adds a file present on my local file system to HDFS and
adds it to the distributed cache.


However, in my mapper code, when I try to access the file stored in
distributed cache, the Path[] P variable gets null value. d


public void configure(JobConf conf) {   
this.conf = conf;   try
{   Path [] 
p=DistributedCache.getLocalCacheFiles(conf);} catch
(IOException e) {   // TODO Auto-generated 
catch
block   e.printStackTrace();
}   
}

Even when I tried to access distributed cache from the following code

in my mapper, the code returns the error that bloomfilter file doesn't exist

strm = new DataInputStream(new FileInputStream("bloomfilter"));// Read
into our Bloom filter.filter.readFields(strm);strm.close();

However, I read somewhere that if we add a file to distributed cache,
we can access it

directly from its name.

Can you please help me out ?