RE: Hadoop Usecase

2011-12-07 Thread Ravi teja ch n v
Hi Shreya,

We had a similar question and some discussions, this mailing thread may help 
you.

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201112.mbox/%3cCALH6cCNSXQye8F6geJiUDu+30Q4==EOd1pmU+rzpj50_evC5=w...@mail.gmail.com%3e

Regards,
Ravi Teja

From: shreya@cognizant.com [shreya@cognizant.com]
Sent: 07 December 2011 13:47:02
To: common-user@hadoop.apache.org
Subject: Hadoop Usecase

Hi,

I am trying to implement the following use case, is that possible in
hadoop:
I would have to use Hive or Hbase?

Data comes in at hourly intervals, 15 min, 10 min and 5 min interval
Goal are :
1. Compare incoming data with the data stored in existing system
2. Identifying incremental changes
3. Changes should be identified as : Insert, Update, Delete
4. Loading the incremental changes in target system

Is this possible or efficient if Hadoop has to be used?
Please Advice

Regards,
Shreya
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful.


RE: Running a job continuously

2011-12-05 Thread Ravi teja ch n v
Hi Burak,

Bejoy Ks, i have a continuous inflow of data but i think i need a near
real-time system.

Just to add to Bejoy's point, 
with Oozie, you can specify the data dependency for running your job.
When specific amount of data is in, your can configure Oozie to run your job.
I think this will suffice your requirement.

Regards,
Ravi Teja


From: burakkk [burak.isi...@gmail.com]
Sent: 06 December 2011 04:03:59
To: mapreduce-u...@hadoop.apache.org
Cc: common-user@hadoop.apache.org
Subject: Re: Running a job continuously

Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
execute the MR job on the same algorithm but different files have different
velocity.

Both Storm and facebook's hadoop are designed for that. But i want to use
apache distribution.

Bejoy Ks, i have a continuous inflow of data but i think i need a near
real-time system.

Mike Spreitzer, both output and input are continuous. Output isn't relevant
to the input. Only that i want is all the incoming files are processed by
the same job and the same algorithm.
For ex, you think about wordcount problem. When you want to run wordcount,
you implement that:
http://wiki.apache.org/hadoop/WordCount

But when the program find that code job.waitForCompletion(true);, somehow
job will end up. When you want to make it continuously, what will you do in
hadoop without other tools?
One more thing is you assumption that the input file's name is
filename_timestamp(filename_20111206_0030)

public static void main(String[] args) throws Exception {Configuration
conf = new Configuration();Job job = new Job(conf,
wordcount);job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
  job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true); }

On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Burak
If you have a continuous inflow of data, you can choose flume to
 aggregate the files into larger sequence files or so if they are small and
 when you have a substantial chunk of data(equal to hdfs block size). You
 can push that data on to hdfs based on your SLAs you need to schedule your
 jobs using oozie or simpe shell script. In very simple terms
 - push input data (could be from flume collector) into a staging hdfs dir
 - before triggering the job(hadoop jar) copy the input from staging to
 main input dir
 - execute the job
 - archive the input and output into archive dirs(any other dirs).
- the output archive dir could be source of output data
 - delete output dir and empty input dir

 Hope it helps!...

 Regards
 Bejoy.K.S

 On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:

 Hi everyone,
 I want to run a MR job continuously. Because i have streaming data and i
 try to analyze it all the time in my way(algorithm). For example you want
 to solve wordcount problem. It's the simplest one :) If you have some
 multiple files and the new files are keep going, how do you handle it?
 You could execute a MR job per one file but you have to do it repeatly. So
 what do you think?

 Thanks
 Best regards...

 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *





--

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*


RE: HDFS Explained as Comics

2011-12-01 Thread Ravi teja ch n v
Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :)

Regards,
Ravi Teja

From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be]
Sent: 01 December 2011 15:11:36
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics

Very clear.  The comic format works indeed quite well.
I never considered comics as a serious (professional) way to get something 
explained efficiently,
but this shows people should think twice before they start writing their next 
documentation.

one question though: if a DN has a corrupted block, why does the NN only remove 
the bad DN from the block's list, and not the block from the DN list?
(also, does it really store the data in 2 separate tables?  This looks to me 
like 2 different views of the same data?)

Dieter

On Thu, 1 Dec 2011 08:53:31 +0100
Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

 Hi all,

 very cool comic!

 Thanks,
  Alex

 On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
 manu.i...@gmail.com
  wrote:

  Hi,
 
  This is indeed a good way to explain, most of the improvement has
  already been discussed. waiting for sequel of this comic.
 
  Regards,
  Abhishek
 
  On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
  mvarsh...@gmail.com
  wrote:
 
   Hi Matthew
  
   I agree with both you and Prashant. The strip needs to be
   modified to explain that these can be default values that can be
   optionally
  overridden
   (which I will fix in the next iteration).
  
   However, from the 'understanding concepts of HDFS' point of view,
   I still think that block size and replication factors are the
   real strengths of HDFS, and the learners must be exposed to them
   so that they get to see
  how
   hdfs is significantly different from conventional file systems.
  
   On personal note: thanks for the first part of your message :)
  
   -Maneesh
  
  
   On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
   matthew.go...@monsanto.com wrote:
  
Maneesh,
   
Firstly, I love the comic :)
   
Secondly, I am inclined to agree with Prashant on this latest
point.
   While
one code path could take us through the user defining command
line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
might
  confuse
   a
person new to Hadoop. The most common flow would be using admin
   determined
values from hdfs-site and the only thing that would need to
change is
   that
conversation happening between client / server and not user /
client.
   
Matt
   
-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com]
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics
   
Sure, its just a case of how readers interpret it.
   
  1. Client is required to specify block size and replication
factor
  each
  time
  2. Client does not need to worry about it since an admin has
set the properties in default configuration files
   
A client could not be allowed to override the default configs
if they
  are
set final (well there are ways to go around it as well as you
suggest
  by
using create() :)
   
The information is great and helpful. Just want to make sure a
beginner
   who
wants to write a WordCount in Mapreduce does not worry about
  specifying
block size' and replication factor in his code.
   
Thanks,
Prashant
   
On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
mvarsh...@gmail.com
wrote:
   
 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
 of
  block
size
 and replication factor. In the source code, I see the
 following in
  the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size,
   DEFAULT_BLOCK_SIZE);

defaultReplication = (short)
 conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following
 chain for
   the
 values:
 1. Manual values (the long form constructor; when a user
 provides
  these
 values)
 2. Configuration file values (these are cluster level
 defaults: dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the
 org.apache.hadoop.hdfs.protocool.ClientProtocol the
   API
to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of
 these
   values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a client 

RE: HDFS Explained as Comics

2011-12-01 Thread Ravi teja ch n v

Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :)

Regards,
Ravi Teja

From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be]
Sent: 01 December 2011 15:11:36
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics

Very clear.  The comic format works indeed quite well.
I never considered comics as a serious (professional) way to get something 
explained efficiently,
but this shows people should think twice before they start writing their next 
documentation.

one question though: if a DN has a corrupted block, why does the NN only remove 
the bad DN from the block's list, and not the block from the DN list?
(also, does it really store the data in 2 separate tables?  This looks to me 
like 2 different views of the same data?)

Dieter

On Thu, 1 Dec 2011 08:53:31 +0100
Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

 Hi all,

 very cool comic!

 Thanks,
  Alex

 On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
 manu.i...@gmail.com
  wrote:

  Hi,
 
  This is indeed a good way to explain, most of the improvement has
  already been discussed. waiting for sequel of this comic.
 
  Regards,
  Abhishek
 
  On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
  mvarsh...@gmail.com
  wrote:
 
   Hi Matthew
  
   I agree with both you and Prashant. The strip needs to be
   modified to explain that these can be default values that can be
   optionally
  overridden
   (which I will fix in the next iteration).
  
   However, from the 'understanding concepts of HDFS' point of view,
   I still think that block size and replication factors are the
   real strengths of HDFS, and the learners must be exposed to them
   so that they get to see
  how
   hdfs is significantly different from conventional file systems.
  
   On personal note: thanks for the first part of your message :)
  
   -Maneesh
  
  
   On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
   matthew.go...@monsanto.com wrote:
  
Maneesh,
   
Firstly, I love the comic :)
   
Secondly, I am inclined to agree with Prashant on this latest
point.
   While
one code path could take us through the user defining command
line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
might
  confuse
   a
person new to Hadoop. The most common flow would be using admin
   determined
values from hdfs-site and the only thing that would need to
change is
   that
conversation happening between client / server and not user /
client.
   
Matt
   
-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com]
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics
   
Sure, its just a case of how readers interpret it.
   
  1. Client is required to specify block size and replication
factor
  each
  time
  2. Client does not need to worry about it since an admin has
set the properties in default configuration files
   
A client could not be allowed to override the default configs
if they
  are
set final (well there are ways to go around it as well as you
suggest
  by
using create() :)
   
The information is great and helpful. Just want to make sure a
beginner
   who
wants to write a WordCount in Mapreduce does not worry about
  specifying
block size' and replication factor in his code.
   
Thanks,
Prashant
   
On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
mvarsh...@gmail.com
wrote:
   
 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
 of
  block
size
 and replication factor. In the source code, I see the
 following in
  the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size,
   DEFAULT_BLOCK_SIZE);

defaultReplication = (short)
 conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following
 chain for
   the
 values:
 1. Manual values (the long form constructor; when a user
 provides
  these
 values)
 2. Configuration file values (these are cluster level
 defaults: dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the
 org.apache.hadoop.hdfs.protocool.ClientProtocol the
   API
to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of
 these
   values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a 

RE: Hadoop Metrics

2011-11-22 Thread Ravi teja ch n v
Hi Paolo,

If you are using versions later than 23.0, then you can refer to the Hadoop 
Definitive Guide 2nd Edition,
The metrics in the latest versions have changed a little bit,and documentation 
is awaited for Next Gen Mapreduce.

Regards,
Ravi Teja


From: Paolo Rodeghiero [paolo@gmail.com]
Sent: 18 November 2011 00:05:07
To: common-user@hadoop.apache.org
Subject: Hadoop Metrics

Hi,
I'm developing on top of Hadoop for my Master final thesis, which aims
to connect theoretical MapReduce modelling to a more practical ground.
A part of the project is about analyzing the impact of some parameters
to Hadoop internals.

To do so, I need some deep understanding of the Metrics produced.
There is some sort of documentations about what a metric exactly is?
I looked for that but was unable to find it.
Should I ask to the dev list for that?

Cheers,
Paolo