Re: namenode restart

2012-06-07 Thread Abhishek Pratap Singh
Hi Rita,

What kind of client you have?? Please be sure that there is no job running
and especially no data is copying while you do the restart.
Also are you restarting by using stop-dfs.sh and start-dfs.sh?

Regards,
Abhishek

On Thu, Jun 7, 2012 at 3:29 AM, Rita rmorgan...@gmail.com wrote:

 Running Hadoop 0.22 and I need to restart the namenode so my new rack
 configuration will be set into place. I am thinking of doing a quick stop
 and start of the namenode but what will happen to the current clients? Can
 they tolerate a 30 second hiccup by retrying?

 --
 --- Get your facts first, then you can distort them as you please.--



Re: Ideal file size

2012-06-07 Thread Abhishek Pratap Singh
Almost all the answers is already provided in this post. My 2 cents... try
to have a file size in multiple of block size so while processing number of
mappers are less and performance of the job is better..
You can also merge file in HDFS later on for processing.

Regards,
Abhishek



On Thu, Jun 7, 2012 at 7:29 AM, M. C. Srivas mcsri...@gmail.com wrote:

 On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote:
 
   Many factors to consider than just the size of the file.  . How long
 can
   you wait before you *have to* process the data?  5 minutes? 5 hours? 5
   days?  If you want good timeliness, you need to roll-over faster.  The
   longer you wait:
  
   1.  the lesser the load on the NN.
   2.  but the poorer the timeliness
   3.  and the larger chance of lost data  (ie, the data is not saved
 until
   the file is closed and rolled over, unless you want to sync() after
 every
   write)
  
   To Begin with I was going to use Flume and specify rollover file size.
 I
  understand the above parameters, I just want to ensure that too many
 small
  files doesn't cause problem on the NameNode. For instance there would be
  times when we get GBs of data in an hour and at times only few 100 MB.
 From
  what Harsh, Edward and you've described it doesn't cause issues with the
  NameNode but rather increase in processing times if there are too many
  small files. Looks like I need to find that balance.
 
  It would also be interesting to see how others solve this problem when
 not
  using Flume.
 


 They use NFS with MapR.

 Any and all log-rotators (like the one in log4j) simply just work over NFS,
 and MapR does not have a NN, so the problems with small files or number of
 files do not exist.



 
 
  
  
   On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
We have continuous flow of data into the sequence file. I am
 wondering
   what
would be the ideal file size before file gets rolled over. I know too
   many
small files are not good but could someone tell me what would be the
   ideal
size such that it doesn't overload NameNode.
   
  
 



Re: dfs.replication factor for MR jobs

2012-05-17 Thread Abhishek Pratap Singh
Hi Aishwarya,

Temporary output of mapper is used for reducer. And number of Reduce jobs
are based on the output keys of Mapper. It has nothing to do with
replication factor.  It is writing to three nodes because at least three
keys has been generated from mapper and assigned reducer to three different
nodes.

Regards,
Abhishek

On Thu, May 17, 2012 at 2:06 PM, Aishwarya Venkataraman 
avenk...@cs.ucsd.edu wrote:

 Hello,

 I have a 4-node cluster. One namenode and 3 other datanodes. I want to
 explicitly set the dfs.replication factor to 1 inorder to run some
 experiments. I tried setting this via the hdfs-site.xml file and via
 the command line as well (hadoop dfs -setrep -R -w 1 /). But I have a
 feeling that the replication factor that hdfs is seeing is 3. It seems
 to be writing the temporary mapper outputs to all the 3 datanodes. Is
 this the default configuration for MR jobs ? If no, how can I set this
 to 1 ?

 Thanks,
 Aishwarya



Re: is hadoop suitable for us?

2012-05-17 Thread Abhishek Pratap Singh
Hi,

For your question if HADOOP can be used without HDFS, the answer is Yes.
Hadoop can be used with any kind of distributed file system.
But I m not able to understand the problem statement clearly to advice my
point of view.
Are you processing text file and saving in distributed database??

Regards,
Abhishek

On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois 
pad...@gmail.com wrote:

 We want to distribute processing of text files.. processing of large
 machine learning tasks, have a distributed database as we have big amount
 of data etc.

 The problem is that each VM can have up to 2TB of data (limitation of VM),
 and we have 20TB of data. So we have to distribute the processing, the
 database etc. But all those data will be in a shared huge central file
 system.

 We heard about myHadoop, but we are not sure why is that any different from
 Hadoop.

 If we run hadoop/mapreduce without using HDFS? is that an option?

 best,
 PA


 2012/5/17 Mathias Herberts mathias.herbe...@gmail.com

  Hadoop does not perform well with shared storage and vms.
 
  The question should be asked first regarding what you're trying to
 achieve,
  not about your infra.
  On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois 
  pad...@gmail.com wrote:
 
   Hello,
  
   We have about 50 VMs and we want to distribute processing across them.
   However these VMs share a huge data storage system and thus their
  virtual
   HDD are all located in the same computer. Would Hadoop be useful for
 such
   configuration? Could we use hadoop without HDFS? so that we can
 retrieve
   and store everything in the same storage?
  
   Thanks,
   PA
  
 



Re: Resource underutilization / final reduce tasks only uses half of cluster ( tasktracker map/reduce slots )

2012-05-14 Thread Abhishek Pratap Singh
Hi JD,

Number of reduce task will depend upon the key after all the mapper is
done. if the key is same than all the data will go to one node, similarly
utilization of all nodes of cluster will depend upon the number of
different keys for reduce task.


Regards,
Abhishek

On Fri, May 11, 2012 at 4:57 PM, Jeremy Davis
jda...@upstreamsoftware.comwrote:


 I see mapred.tasktracker.reduce.tasks.maximum and
 mapred.tasktracker.map.tasks.maximum, but I'm wondering if there isn't
 another tuning parameter I need to look at.

 I can tune the task tracker so that when I have many jobs running, with
 many simultaneous maps and reduces I utilize 95% of cpu and memory.

 Inevitably though I end up with a huge final reduce task that only uses
 half of of my cluster because I have reserved the other half for Mapping.

 Is there a way around this problem?

 Seems like there should also be a maximum number of reducers conditional
 on no Map tasks running.

 -JD


Re: Moving files from JBoss server to HDFS

2012-05-14 Thread Abhishek Pratap Singh
What about using flume to consume the files and send to HDFS? It depends
upon what security implication you are looking for.
It will be easier if you can look into answer of similar question below.

Is Jboss server process is pushing file to HDFS or some other process is
pushing file to HDFS?
The intermediate server, is doing some security check on data or user
access?  What processing is done on intermediate server? if not how it is
different from sending the file directly to HDFS?

Regards,
Abhishek

On Sat, May 12, 2012 at 3:56 AM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi  financeturd financet...@yahoo.com,
My Point of view  second step like bellow is the good approach

 {Separate server} -- {JBoss server}
 and then
 {Separate server} -- HDFS

 thanks
   samir

 On Sat, May 12, 2012 at 6:00 AM, financeturd financeturd 
 financet...@yahoo.com wrote:

  Hello,
 
  We have a large number of
  custom-generated files (not just web logs) that we need to move from our
  JBoss servers to HDFS.  Our first implementation ran a cron job every 5
  minutes to move our files from the output directory to HDFS.
 
  Is this recommended?  We are being told by our IT team that our JBoss
  servers should not have access to HDFS for security reasons.  The files
  must be sucked to HDFS by other servers that do not accept traffic
  from the outside.  In essence, they are asking for a layer of
  indirection.  Instead of:
  {JBoss server} -- {HDFS}
  it's being requested that it look like:
  {Separate server} -- {JBoss server}
  and then
  {Separate server} -- HDFS
 
 
  While I understand in principle what is being said, the security of
 having
  processes on JBoss servers writing files to HDFS doesn't seem any worse
  than having Tomcat servers access a central database, which they do.
 
  Can anyone comment on what a recommended approach would be?  Should our
  JBoss servers push their data to HDFS or should the data be pulled by
  another server and then placed into HDFS?
 
  Thank you!
  FT



Re: Number of Reduce Tasks

2012-05-14 Thread Abhishek Pratap Singh
AFAIK Number of reducers depends upon the Key generated after Mappers are
done. May be join is resulting in one key.

Regards,
Abhishek
On Sun, May 13, 2012 at 10:48 PM, anwar shaikh anwardsha...@gmail.comwrote:

 Hi Everybody,

 I am executing a MapReduce job to execute JOIN operation using
 org.apache.hadoop.contrib.utils.join

 Four files are given as Input.

 I think there are four Map Jobs running (based on the line marked in red ).

 I have also set number of reducers to be 10  using - *
 job.setNumReduceTasks(10) *
 *
 *
 But, only one reduce task is performed  (line marked in blue).

 So, Please can you suggest how can I increase the number of reducers ?

 Below are some of the last lines from the log.


 -
 12/05/14 10:32:46 INFO mapred.Task: Task '*attempt_local_0001_m_03_0*'
 done.
 12/05/14 10:32:46 INFO mapred.LocalJobRunner:
 12/05/14 10:32:46 INFO mapred.Merger: Merging 4 sorted segments
 12/05/14 10:32:46 INFO mapred.Merger: Down to the last merge-pass, with 4
 segments left of total size: 8018 bytes
 12/05/14 10:32:46 INFO mapred.LocalJobRunner:
 12/05/14 10:32:46 INFO datajoin.job: key: 1 this.largestNumOfValues: 48
 12/05/14 10:32:46 INFO mapred.Task: Task:attempt_local_0001_r_00_0 is
 done. And is in the process of commiting
 12/05/14 10:32:46 INFO mapred.LocalJobRunner:
 12/05/14 10:32:46 INFO mapred.Task: Task attempt_local_0001_r_00_0 is
 allowed to commit now
 12/05/14 10:32:46 INFO mapred.FileOutputCommitter: Saved output of task
 'attempt_local_0001_r_00_0' to
 file:/home/anwar/workspace/JoinLZOPfiles/OutLarge
 12/05/14 10:32:49 INFO mapred.LocalJobRunner: actuallyCollectedCount 86
 collectedCount 86
 groupCount 25
   reduce
 12/05/14 10:32:49 INFO mapred.Task: Task
 '*attempt_local_0001_r_00_0'*done.
 12/05/14 10:32:50 INFO mapred.JobClient:  map 100% reduce 100%
 12/05/14 10:32:50 INFO mapred.JobClient: Job complete: job_local_0001
 12/05/14 10:32:50 INFO mapred.JobClient: Counters: 17
 12/05/14 10:32:50 INFO mapred.JobClient:   File Input Format Counters
 12/05/14 10:32:50 INFO mapred.JobClient: Bytes Read=1666
 12/05/14 10:32:50 INFO mapred.JobClient:   File Output Format Counters
 12/05/14 10:32:50 INFO mapred.JobClient: Bytes Written=2421
 12/05/14 10:32:50 INFO mapred.JobClient:   FileSystemCounters
 12/05/14 10:32:50 INFO mapred.JobClient: FILE_BYTES_READ=22890
 12/05/14 10:32:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=194702
 12/05/14 10:32:50 INFO mapred.JobClient:   Map-Reduce Framework
 12/05/14 10:32:50 INFO mapred.JobClient: Map output materialized
 bytes=8034
 12/05/14 10:32:50 INFO mapred.JobClient: Map input records=106
 12/05/14 10:32:50 INFO mapred.JobClient: Reduce shuffle bytes=0
 12/05/14 10:32:50 INFO mapred.JobClient: Spilled Records=212
 12/05/14 10:32:50 INFO mapred.JobClient: Map output bytes=7798
 12/05/14 10:32:50 INFO mapred.JobClient: Map input bytes=1666
 12/05/14 10:32:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=472
 12/05/14 10:32:50 INFO mapred.JobClient: Combine input records=0
 12/05/14 10:32:50 INFO mapred.JobClient: Reduce input records=106
 12/05/14 10:32:50 INFO mapred.JobClient: Reduce input groups=25
 12/05/14 10:32:50 INFO mapred.JobClient: Combine output records=0
 12/05/14 10:32:50 INFO mapred.JobClient: Reduce output records=86
 12/05/14 10:32:50 INFO mapred.JobClient: Map output records=106

 --
 Mr. Anwar Shaikh
 Delhi Technological University, Delhi
 +91 92 50 77 12 44



Re: Multiple data centre in Hadoop

2012-04-17 Thread Abhishek Pratap Singh
Thanks bobby, I m looking for something like this. Now the question is
what is the best strategy to do Hot/Hot or Hot/Warm.
I need to consider the CPU and Network bandwidth, also needs to decide from
which layer this replication should start.

Regards,
Abhishek

On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans ev...@yahoo-inc.com wrote:

 Hi Abhishek,

 Manu is correct about High Availability within a single colo.  I realize
 that in some cases you have to have fail over between colos.  I am not
 aware of any turn key solution for things like that, but generally what you
 want to do is to run two clusters, one in each colo, either hot/hot or
 hot/warm, and I have seen both depending on how quickly you need to fail
 over.  In hot/hot the input data is replicated to both clusters and the
 same software is run on both.  In this case though you have to be fairly
 sure that your processing is deterministic, or the results could be
 slightly different (i.e. No generating if random ids).  In hot/warm the
 data is replicated from one colo to the other at defined checkpoints.  The
 data is only processed on one of the grids, but if that colo goes down the
 other one can take up the processing from where ever the last checkpoint
 was.

 I hope that helps.

 --Bobby

 On 4/12/12 5:07 AM, Manu S manupk...@gmail.com wrote:

 Hi Abhishek,

 1. Use multiple directories for *dfs.name.dir*  *dfs.data.dir* etc
 * Recommendation: write to *two local directories on different
 physical volumes*, and to an *NFS-mounted* directory
 - Data will be preserved even in the event of a total failure of the
 NameNode machines
 * Recommendation: *soft-mount the NFS* directory
 - If the NFS mount goes offline, this will not cause the NameNode
 to fail

 2. *Rack awareness*

 https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

 On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
 manu.i...@gmail.comwrote:

  Thanks Robert.
  Is there a best practice or design than can address the High Availability
  to certain extent?
 
  ~Abhishek
 
  On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com
  wrote:
 
   No it does not. Sorry
  
  
   On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:
  
   Hi All,
  
   Just wanted if hadoop supports more than one data centre. This is
  basically
   for DR purposes and High Availability where one centre goes down other
  can
   bring up.
  
  
   Regards,
   Abhishek
  
  
 



 --
 Thanks  Regards
 
 *Manu S*
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in




Re: Multiple data centre in Hadoop

2012-04-11 Thread Abhishek Pratap Singh
Thanks Robert.
Is there a best practice or design than can address the High Availability
to certain extent?

~Abhishek

On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans ev...@yahoo-inc.com wrote:

 No it does not. Sorry


 On 4/11/12 1:44 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote:

 Hi All,

 Just wanted if hadoop supports more than one data centre. This is basically
 for DR purposes and High Availability where one centre goes down other can
 bring up.


 Regards,
 Abhishek




Zero Byte file in HDFS

2012-03-26 Thread Abhishek Pratap Singh
Hi All,

I was just going through the implementation scenario of avoiding or
deleting Zero byte file in HDFS. I m using Hive partition table where the
data in partition come from INSERT OVERWRITE command using the SELECT from
few other tables.
Sometimes 0 Byte files are being generated in those partitions and during
the course of time the amount of these files in the HDFS will increase
enormously, decreasing the performance of hadoop job on that table /
folder. I m looking for best way to avoid generation or deleting the zero
byte file.

I can think of few ways to implement this

1) Programmatically using the Filesystem object and cleaning the zero byte
file.
2) Using Hadoop fs and Linux command combination to identify the zero byte
file and delete it.
3) LazyOutputFormat (Applicable in Hadoop based custom jobs).

Kindly guide on efficient ways to achieve the same.

Regards,
Abhishek


Re: Zero Byte file in HDFS

2012-03-26 Thread Abhishek Pratap Singh
Thanks Bejoy for this input, Does this merging will combine all the small
files to least block size for the very first mappers of the hive job?
Well i ll explore on this, my interest on deleting the zero byte files from
HDFS comes from reducing the cost of Bookkeeping these files in system. The
metadata of any files in HDFS occupy approx 150kb space, considering
thousand or millions of zero byte file in HDFS will cost a lot.

Regards,
Abhishek


On Mon, Mar 26, 2012 at 2:27 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 Hi Abshikek
   I can propose a better solution. Enable merge in hive. So that the
 smaller files would be merged to at lest the hdfs block size(your choice)
 and would benefit subsequent hive jobs on the same.

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.

 -Original Message-
 From: Abhishek Pratap Singh manu.i...@gmail.com
 Date: Mon, 26 Mar 2012 14:20:18
 To: common-user@hadoop.apache.org; u...@hive.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Zero Byte file in HDFS

 Hi All,

 I was just going through the implementation scenario of avoiding or
 deleting Zero byte file in HDFS. I m using Hive partition table where the
 data in partition come from INSERT OVERWRITE command using the SELECT from
 few other tables.
 Sometimes 0 Byte files are being generated in those partitions and during
 the course of time the amount of these files in the HDFS will increase
 enormously, decreasing the performance of hadoop job on that table /
 folder. I m looking for best way to avoid generation or deleting the zero
 byte file.

 I can think of few ways to implement this

 1) Programmatically using the Filesystem object and cleaning the zero byte
 file.
 2) Using Hadoop fs and Linux command combination to identify the zero byte
 file and delete it.
 3) LazyOutputFormat (Applicable in Hadoop based custom jobs).

 Kindly guide on efficient ways to achieve the same.

 Regards,
 Abhishek




Re: Zero Byte file in HDFS

2012-03-26 Thread Abhishek Pratap Singh
This sounds goods as long as the output of select query has at least one
row. But in my case it can be zero rows.

Thanks,
Abhishek

On Mon, Mar 26, 2012 at 2:48 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 **
 Hi Abshiek
 Merging happens as a last stage of hive jobs. Say your hive query is
 translated to n MR jobs when you enable merge you can set a size that is
 needed to merge (usually block size). So after n MR jobs there would be a
 map only job that is automatically triggered from hive which merges the
 smaller files into larger ones. The intermediate output files are not
 retained in hive and hence the final large enough files remains in hdfs.

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
 --
 *From: * Abhishek Pratap Singh manu.i...@gmail.com
 *Date: *Mon, 26 Mar 2012 14:41:53 -0700
 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com
 *Cc: *u...@hive.apache.org
 *Subject: *Re: Zero Byte file in HDFS

 Thanks Bejoy for this input, Does this merging will combine all the small
 files to least block size for the very first mappers of the hive job?
 Well i ll explore on this, my interest on deleting the zero byte files
 from HDFS comes from reducing the cost of Bookkeeping these files in
 system. The metadata of any files in HDFS occupy approx 150kb space,
 considering thousand or millions of zero byte file in HDFS will cost a lot.

 Regards,
 Abhishek


 On Mon, Mar 26, 2012 at 2:27 PM, Bejoy KS bejoy.had...@gmail.com wrote:

 Hi Abshikek
   I can propose a better solution. Enable merge in hive. So that the
 smaller files would be merged to at lest the hdfs block size(your choice)
 and would benefit subsequent hive jobs on the same.

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.

 -Original Message-
 From: Abhishek Pratap Singh manu.i...@gmail.com
 Date: Mon, 26 Mar 2012 14:20:18
 To: common-user@hadoop.apache.org; u...@hive.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Zero Byte file in HDFS

 Hi All,

 I was just going through the implementation scenario of avoiding or
 deleting Zero byte file in HDFS. I m using Hive partition table where the
 data in partition come from INSERT OVERWRITE command using the SELECT from
 few other tables.
 Sometimes 0 Byte files are being generated in those partitions and during
 the course of time the amount of these files in the HDFS will increase
 enormously, decreasing the performance of hadoop job on that table /
 folder. I m looking for best way to avoid generation or deleting the zero
 byte file.

 I can think of few ways to implement this

 1) Programmatically using the Filesystem object and cleaning the zero byte
 file.
 2) Using Hadoop fs and Linux command combination to identify the zero byte
 file and delete it.
 3) LazyOutputFormat (Applicable in Hadoop based custom jobs).

 Kindly guide on efficient ways to achieve the same.

 Regards,
 Abhishek





Regarding pointers for LZO compression in Hive and Hadoop

2011-12-14 Thread Abhishek Pratap Singh
Hi,

I m looking for some useful docs on enabling LZO on hadoop cluster. I tried
few of the blogs, but somehow its not working.
Here is my requirement.

I have a hadoop 0.20.2 and Hive 0.6. I have some tables with 1.5 TB of
data, i want to compress them using LZO and enable LZO in hive as well as
in hadoop.
Let me know if you have any useful docs or pointers for the same.


Regards,
Abhishek


Re: Running a job continuously

2011-12-05 Thread Abhishek Pratap Singh
Hi Burak,

The model of hadoop is very different, it is based on Job based model, in
more easy words its a kind of Batch model where map reduce job is executed
on a batch of data which is already present.
As per your requirement, word count example doesn't make sense if the file
has been written continuously.
However word count per hour or per min make sense in map reduce type of
program.
I second  what Bejoy has mentioned is to use flume, aggregate the data and
then run map reduce.
Hadoop can give you near real time by using flume with Map reduce where you
can run map reduce job on flume dumped data every few mins.
Second option is to see if your problem  can be solved by Flume Decorator
itself for real time experience.

Regards,
Abhishek

On Mon, Dec 5, 2011 at 2:33 PM, burakkk burak.isi...@gmail.com wrote:

 Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
 execute the MR job on the same algorithm but different files have different
 velocity.

 Both Storm and facebook's hadoop are designed for that. But i want to use
 apache distribution.

 Bejoy Ks, i have a continuous inflow of data but i think i need a near
 real-time system.

 Mike Spreitzer, both output and input are continuous. Output isn't relevant
 to the input. Only that i want is all the incoming files are processed by
 the same job and the same algorithm.
 For ex, you think about wordcount problem. When you want to run wordcount,
 you implement that:
 http://wiki.apache.org/hadoop/WordCount

 But when the program find that code job.waitForCompletion(true);, somehow
 job will end up. When you want to make it continuously, what will you do in
 hadoop without other tools?
 One more thing is you assumption that the input file's name is
 filename_timestamp(filename_20111206_0030)

 public static void main(String[] args) throws Exception {Configuration
 conf = new Configuration();Job job = new Job(conf,
 wordcount);job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
  job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 job.waitForCompletion(true); }

 On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Burak
 If you have a continuous inflow of data, you can choose flume to
  aggregate the files into larger sequence files or so if they are small
 and
  when you have a substantial chunk of data(equal to hdfs block size). You
  can push that data on to hdfs based on your SLAs you need to schedule
 your
  jobs using oozie or simpe shell script. In very simple terms
  - push input data (could be from flume collector) into a staging hdfs dir
  - before triggering the job(hadoop jar) copy the input from staging to
  main input dir
  - execute the job
  - archive the input and output into archive dirs(any other dirs).
 - the output archive dir could be source of output data
  - delete output dir and empty input dir
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:
 
  Hi everyone,
  I want to run a MR job continuously. Because i have streaming data and i
  try to analyze it all the time in my way(algorithm). For example you
 want
  to solve wordcount problem. It's the simplest one :) If you have some
  multiple files and the new files are keep going, how do you handle it?
  You could execute a MR job per one file but you have to do it repeatly.
 So
  what do you think?
 
  Thanks
  Best regards...
 
  --
 
  *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
  *
  *
 
 
 


 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *



Re: HDFS Explained as Comics

2011-11-30 Thread Abhishek Pratap Singh
Hi,

This is indeed a good way to explain, most of the improvement has already
been discussed. waiting for sequel of this comic.

Regards,
Abhishek

On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.comwrote:

 Hi Matthew

 I agree with both you and Prashant. The strip needs to be modified to
 explain that these can be default values that can be optionally overridden
 (which I will fix in the next iteration).

 However, from the 'understanding concepts of HDFS' point of view, I still
 think that block size and replication factors are the real strengths of
 HDFS, and the learners must be exposed to them so that they get to see how
 hdfs is significantly different from conventional file systems.

 On personal note: thanks for the first part of your message :)

 -Maneesh


 On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
 matthew.go...@monsanto.com wrote:

  Maneesh,
 
  Firstly, I love the comic :)
 
  Secondly, I am inclined to agree with Prashant on this latest point.
 While
  one code path could take us through the user defining command line
  overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse
 a
  person new to Hadoop. The most common flow would be using admin
 determined
  values from hdfs-site and the only thing that would need to change is
 that
  conversation happening between client / server and not user / client.
 
  Matt
 
  -Original Message-
  From: Prashant Kommireddi [mailto:prash1...@gmail.com]
  Sent: Wednesday, November 30, 2011 3:28 PM
  To: common-user@hadoop.apache.org
  Subject: Re: HDFS Explained as Comics
 
  Sure, its just a case of how readers interpret it.
 
1. Client is required to specify block size and replication factor each
time
2. Client does not need to worry about it since an admin has set the
properties in default configuration files
 
  A client could not be allowed to override the default configs if they are
  set final (well there are ways to go around it as well as you suggest by
  using create() :)
 
  The information is great and helpful. Just want to make sure a beginner
 who
  wants to write a WordCount in Mapreduce does not worry about specifying
  block size' and replication factor in his code.
 
  Thanks,
  Prashant
 
  On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com
  wrote:
 
   Hi Prashant
  
   Others may correct me if I am wrong here..
  
   The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block
  size
   and replication factor. In the source code, I see the following in the
   DFSClient constructor:
  
  defaultBlockSize = conf.getLong(dfs.block.size,
 DEFAULT_BLOCK_SIZE);
  
  defaultReplication = (short) conf.getInt(dfs.replication, 3);
  
   My understanding is that the client considers the following chain for
 the
   values:
   1. Manual values (the long form constructor; when a user provides these
   values)
   2. Configuration file values (these are cluster level defaults:
   dfs.block.size and dfs.replication)
   3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)
  
   Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the
 API
  to
   create a file is
   void create(, short replication, long blocksize);
  
   I presume it means that the client already has knowledge of these
 values
   and passes them to the NameNode when creating a new file.
  
   Hope that helps.
  
   thanks
   -Maneesh
  
   On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
  prash1...@gmail.com
   wrote:
  
Thanks Maneesh.
   
Quick question, does a client really need to know Block size and
replication factor - A lot of times client has no control over these
  (set
at cluster level)
   
-Prashant Kommireddi
   
On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges 
 dejan.men...@gmail.com
wrote:
   
 Hi Maneesh,

 Thanks a lot for this! Just distributed it over the team and
 comments
   are
 great :)

 Best regards,
 Dejan

 On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney 
  mvarsh...@gmail.com
 wrote:

  For your reading pleasure!
 
  PDF 3.3MB uploaded at (the mailing list has a cap of 1MB
   attachments):
 
 

   
  
 
 https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1
 
 
  Appreciate if you can spare some time to peruse this little
   experiment
of
  mine to use Comics as a medium to explain computer science
 topics.
   This
  particular issue explains the protocols and internals of HDFS.
 
  I am eager to hear your opinions on the usefulness of this visual
medium
 to
  teach complex protocols and algorithms.
 
  [My personal motivations: I have always found text descriptions
 to
  be
too
  verbose as lot of effort is spent putting the concepts in proper
 time-space
  context (which can be easily avoided in a visual 

Capacity Planning using Dfsadmin command

2011-11-22 Thread Abhishek Pratap Singh
Hi,



I have a query about hadoop, this is very important for capacity planning
and requesting new hardware.



  Hadoop command: *hadoop dfsadmin –report* depicts the DFS usage and
DFS remaning. I have checked that the DFS usage does not match the
size of *hadoop
dfs –dus* on root or user directory. The difference is high enough, means
dfsadmin shows occupied space as 5TB but dus shows it is 3.5TB.

Does anyone knows why there is a more space occupied shown in dfsadmin?



Regards,

Abhishek