RE: Hadoop Usecase
Hi Shreya, We had a similar question and some discussions, this mailing thread may help you. http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201112.mbox/%3cCALH6cCNSXQye8F6geJiUDu+30Q4==EOd1pmU+rzpj50_evC5=w...@mail.gmail.com%3e Regards, Ravi Teja From: shreya@cognizant.com [shreya@cognizant.com] Sent: 07 December 2011 13:47:02 To: common-user@hadoop.apache.org Subject: Hadoop Usecase Hi, I am trying to implement the following use case, is that possible in hadoop: I would have to use Hive or Hbase? Data comes in at hourly intervals, 15 min, 10 min and 5 min interval Goal are : 1. Compare incoming data with the data stored in existing system 2. Identifying incremental changes 3. Changes should be identified as : Insert, Update, Delete 4. Loading the incremental changes in target system Is this possible or efficient if Hadoop has to be used? Please Advice Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
RE: Running a job continuously
Hi Burak, Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Just to add to Bejoy's point, with Oozie, you can specify the data dependency for running your job. When specific amount of data is in, your can configure Oozie to run your job. I think this will suffice your requirement. Regards, Ravi Teja From: burakkk [burak.isi...@gmail.com] Sent: 06 December 2011 04:03:59 To: mapreduce-u...@hadoop.apache.org Cc: common-user@hadoop.apache.org Subject: Re: Running a job continuously Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Mike Spreitzer, both output and input are continuous. Output isn't relevant to the input. Only that i want is all the incoming files are processed by the same job and the same algorithm. For ex, you think about wordcount problem. When you want to run wordcount, you implement that: http://wiki.apache.org/hadoop/WordCount But when the program find that code job.waitForCompletion(true);, somehow job will end up. When you want to make it continuously, what will you do in hadoop without other tools? One more thing is you assumption that the input file's name is filename_timestamp(filename_20111206_0030) public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, wordcount);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * * -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
RE: HDFS Explained as Comics
Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :) Regards, Ravi Teja From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be] Sent: 01 December 2011 15:11:36 To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Dieter On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client
RE: HDFS Explained as Comics
Thats indeed a great piece of work Maneesh...Waiting for the mapreduce comic :) Regards, Ravi Teja From: Dieter Plaetinck [dieter.plaeti...@intec.ugent.be] Sent: 01 December 2011 15:11:36 To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Dieter On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a
RE: Hadoop Metrics
Hi Paolo, If you are using versions later than 23.0, then you can refer to the Hadoop Definitive Guide 2nd Edition, The metrics in the latest versions have changed a little bit,and documentation is awaited for Next Gen Mapreduce. Regards, Ravi Teja From: Paolo Rodeghiero [paolo@gmail.com] Sent: 18 November 2011 00:05:07 To: common-user@hadoop.apache.org Subject: Hadoop Metrics Hi, I'm developing on top of Hadoop for my Master final thesis, which aims to connect theoretical MapReduce modelling to a more practical ground. A part of the project is about analyzing the impact of some parameters to Hadoop internals. To do so, I need some deep understanding of the Metrics produced. There is some sort of documentations about what a metric exactly is? I looked for that but was unable to find it. Should I ask to the dev list for that? Cheers, Paolo