RE: Experience with Hadoop in production

2012-02-24 Thread GOEKE, MATTHEW (AG/1000)
I would add that it also depends on how thoroughly you have vetted your use 
cases. If you have already ironed out how ad-hoc access works, Kerberos vs 
Firewall and network segmentation, how code submission works, procedures for 
various operational issues, backup of your data, etc (the list is a couple 
hundred bullets long at minimum...) on your current cluster then there might be 
little need for that support. However if you are hoping to figure that stuff 
out still then you could potentially be in a world of hurt when you attempt the 
transition with just your own staff. It also helps to have that outside advice 
in certain situations to resolve cross department conflicts when it comes to 
how the cluster will be implemented :)

Matt

-Original Message-
From: Mike Lyon [mailto:mike.l...@gmail.com] 
Sent: Thursday, February 23, 2012 2:33 PM
To: common-user@hadoop.apache.org
Subject: Re: Experience with Hadoop in production

Just be sure you have that corporate card available 24x7 when you need
to call support ;)

Sent from my iPhone

On Feb 23, 2012, at 10:30, Serge Blazhievsky
serge.blazhiyevs...@nice.com wrote:

 What I have seen companies do often is that they will use free version of
 the commercial vendor and only get their support if there are major
 problems that they cannot solve on their own.


 That way you will get free distribution and insurance that you have
 support if something goes wrong.


 Serge

 On 2/23/12 10:42 AM, Jamack, Peter pjam...@consilium1.com wrote:

 A lot of it depends on your staff and their experiences.
 Maybe they don't have hadoop, but if they were involved with large
 databases, data warehouse, etc they can utilize their skills  experiences
 and provide a lot of help.
 If you have linux admins, system admins, network admins with years of
 experience, they will be a goldmine.At the other end, database
 developers who know SQL, programmers who know Java, and so on can really
 help staff up your 'big data' team. Having a few people who know ETL would
 be great too.

 The biggest problem I've run into seems to be how big the Hadoop
 project/team is or is not. Sometimes it's just an 'experimental'
 department and therefore half the people are only 25-50 percent available
 to help out.  And if they aren't really that knowledgeable about hadoop,
 it tends to be one of those, not enough time in the day scenarios.  And
 the few people dedicated to the Hadoop project(s) will get the brunt of
 the work.

 It's like any ecosystem.  To do it right, you might need system/network
 admins, a storage person to actually know how to set up the proper storage
 architecture, maybe a security expert,  a few programmers, and a few data
 people.   If you're combining analytics, that's another group.  Of course
 most companies outside the Google and Facebooks of the world,  will have a
 few people dedicated to Hadoop.  Which means you need somebody who knows
 storage, knows networking, knows linux, knows how to be a system admin,
 knows security, and maybe other things(AKA if you have a firewall issue,
 somebody needs to figure out ways to make it work through or around),  and
 then you need some programmes who either know MapReduce or can pretty much
 figure it out because they've done java for years.

 Peter J

 On 2/23/12 10:17 AM, Pavel Frolov pfro...@gmail.com wrote:

 Hi,

 We are going into 24x7 production soon and we are considering whether we
 need vendor support or not.  We use a free vendor distribution of Cluster
 Provisioning + Hadoop + HBase and looked at their Enterprise version but
 it
 is very expensive for the value it provides (additional functionality +
 support), given that we¹ve already ironed out many of our performance and
 tuning issues on our own and with generous help from the community (e.g.
 all of you).

 So, I wanted to run it through the community to see if anybody can share
 their experience of running a Hadoop cluster (50+ nodes with Apache
 releases or Vendor distributions) in production, with in-house support
 only, and how difficult it was.  How many people were involved, etc..

 Regards,
 Pavel


This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to 

RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
Dale,

Talking solely about hadoop core you will only need to run 4 daemons on that 
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to 
run multiple of any of them as the tasktracker will spawn multiple child jvms 
which is where you will get your task parallelism. When you set your 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper 
bound of the child jvm creation but this needs to be configured based on job 
profile (I don't know much about Mahoot but traditionally I setup the clusters 
as 2:1 mappers to reducers until the profile proves otherwise). If you look at 
blogs / archives you will see that you can assign 1 child task per *logical* 
core (e.g. hyper threaded core) and to be safe you will want 1 daemon per 
*physical* core so you can divvy it up based on that recommendation.

To summarize the above: if you are sharing the same IO pipe / box then there is 
no reason to have multiple daemons running because you are not really gaining 
anything from that level of granularity. Others might disagree based on 
virtualization but in your case I would say save yourself the headache and keep 
it simple.

Matt

-Original Message-
From: Dale McDiarmid [mailto:d...@ravn.co.uk] 
Sent: Thursday, December 15, 2011 1:50 PM
To: common-user@hadoop.apache.org
Subject: Large server recommedations

Hi all
New to the community and using hadoop and was looking for some advice as 
to optimal configurations on very large servers.  I have a single server 
with 48 cores and 512GB of RAM and am looking to perform an LDA analysis 
using Mahoot across approx 180 million documents.  I have configured my 
namenode and job tracker.  My questions are primarily around the optimal 
number of tasktrackers and data nodes.  I have had no issues configuring 
multiple datanodes, each which could potentially be utilised its own 
disk location (underlying disk is SAN - solid state).

However, from my reading the typical architecture for hadoop is a larger 
number of smaller nodes with a single tasktracker on each host.  Could 
someone please clarify the following:

1. Can multiple task trackers be run on a single host? If so, how is 
this configured as it doesn't seem possible to control the host:port.

2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker 
parameters? The recommendation for these settings seems to related to 
the number of task trackers.  In my architecture, i have potentially 
only 1 if a single task tracker can only be configured on each host.  
What should i set these values to therefore considering the box spec?

3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum - do these control the number of 
JVM processes spawned to handle the respective steps? Is a tasktracker 
with 48 configured equivalent to a 48 task trackers with a value of 1 
configured for these values?

4. Benefits of a large number of datanodes on a single large server? I 
can see value where the host has multiple IO interfaces and disk sets to 
avoid IO contention. In my case, however, a SAN negates this.  Are there 
still benefits of multiple datanodes outside of resiliency and potential 
increase of data transfer i.e. assuming a single data node is limited 
and single threaded?

5. Any other thoughts/recommended settings?

Thanks
Dale
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
mapred.map.tasks is a suggestion to the engine and there is really no reason to 
define it as it will be driven by the block level partitioning of your files 
(e.g. if you have a file that is 30 blocks then it will by default spawn 30 map 
tasks). As for mapred.reduce.tasks, just set it to whatever you set your 
mapred.tasktracker.reduce.tasks.maximum (reasoning being you are running all of 
this on a single tasktracker so these two should essentially line up).

By now I should be able to answer whether those are JT level vs TT level 
parameters but I have heard one thing and personally experienced another so I 
will leave that answer up to someone who can confirm 100%. Either way I would 
recommend that your JT and TT sites not deviate from each other for clarity but 
you can change mapred.reduce.tasks at the app level so if you have something 
that needs a global sort order you can invoke it as mapred.reduce.tasks=1 using 
job level conf.

Matt

From: Dale McDiarmid [mailto:d...@ravn.co.uk]
Sent: Thursday, December 15, 2011 3:58 PM
To: common-user@hadoop.apache.org
Cc: GOEKE, MATTHEW [AG/1000]
Subject: Re: Large server recommedations

thanks matt,
Assuming therefore i run a single tasktracker and have 48 cores available.  
Based on your recommendation of 2:1 mappers to reducer threads i will be 
assigning:


mapred.tasktracker.map.tasks.maximum=30

mapred.tasktracker.reduce.tasks.maximum=15
This brings me onto my question:

Can i confirm mapred.map.tasks and mapred.reduce.tasks are these JobTracker 
parameters? The recommendation for these settings seems to related to the 
number of task trackers. In my architecture, i have potentially only 1 if a 
single task tracker can only be configured on each host. What should i set 
these values to therefore considering the box spec?

I have read:

mapred.local.tasks = 10x of task trackers
mapred.reduce.tasks=2x task trackers

Given i have a single task tracker, with multiple concurrent processes does 
this equates to:

mapred.local.tasks =300?
mapred.reduce.tasks=30?

Some reasoning behind these values appreciated...


appreciate this is a little simplified and we will need to profile. Just 
looking for a sensible starting position.
Thanks
Dale


On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote:

Dale,



Talking solely about hadoop core you will only need to run 4 daemons on that 
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to 
run multiple of any of them as the tasktracker will spawn multiple child jvms 
which is where you will get your task parallelism. When you set your 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper 
bound of the child jvm creation but this needs to be configured based on job 
profile (I don't know much about Mahoot but traditionally I setup the clusters 
as 2:1 mappers to reducers until the profile proves otherwise). If you look at 
blogs / archives you will see that you can assign 1 child task per *logical* 
core (e.g. hyper threaded core) and to be safe you will want 1 daemon per 
*physical* core so you can divvy it up based on that recommendation.



To summarize the above: if you are sharing the same IO pipe / box then there is 
no reason to have multiple daemons running because you are not really gaining 
anything from that level of granularity. Others might disagree based on 
virtualization but in your case I would say save yourself the headache and keep 
it simple.



Matt



-Original Message-

From: Dale McDiarmid [mailto:d...@ravn.co.uk]

Sent: Thursday, December 15, 2011 1:50 PM

To: common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org

Subject: Large server recommedations



Hi all

New to the community and using hadoop and was looking for some advice as

to optimal configurations on very large servers.  I have a single server

with 48 cores and 512GB of RAM and am looking to perform an LDA analysis

using Mahoot across approx 180 million documents.  I have configured my

namenode and job tracker.  My questions are primarily around the optimal

number of tasktrackers and data nodes.  I have had no issues configuring

multiple datanodes, each which could potentially be utilised its own

disk location (underlying disk is SAN - solid state).



However, from my reading the typical architecture for hadoop is a larger

number of smaller nodes with a single tasktracker on each host.  Could

someone please clarify the following:



1. Can multiple task trackers be run on a single host? If so, how is

this configured as it doesn't seem possible to control the host:port.



2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker

parameters? The recommendation for these settings seems to related to

the number of task trackers.  In my architecture, i have potentially

only 1 if a single task tracker can only be configured on each host.

What should i set these values to therefore considering

RE: HDFS Explained as Comics

2011-11-30 Thread GOEKE, MATTHEW (AG/1000)
Maneesh,

Firstly, I love the comic :)

Secondly, I am inclined to agree with Prashant on this latest point. While one 
code path could take us through the user defining command line overrides (e.g. 
hadoop fs -D blah -put foo bar) I think it might confuse a person new to 
Hadoop. The most common flow would be using admin determined values from 
hdfs-site and the only thing that would need to change is that conversation 
happening between client / server and not user / client.

Matt

-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com] 
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics

Sure, its just a case of how readers interpret it.

   1. Client is required to specify block size and replication factor each
   time
   2. Client does not need to worry about it since an admin has set the
   properties in default configuration files

A client could not be allowed to override the default configs if they are
set final (well there are ways to go around it as well as you suggest by
using create() :)

The information is great and helpful. Just want to make sure a beginner who
wants to write a WordCount in Mapreduce does not worry about specifying
block size' and replication factor in his code.

Thanks,
Prashant

On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.comwrote:

 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size
 and replication factor. In the source code, I see the following in the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE);

defaultReplication = (short) conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following chain for the
 values:
 1. Manual values (the long form constructor; when a user provides these
 values)
 2. Configuration file values (these are cluster level defaults:
 dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of these values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a client really need to know Block size and
  replication factor - A lot of times client has no control over these (set
  at cluster level)
 
  -Prashant Kommireddi
 
  On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com
  wrote:
 
   Hi Maneesh,
  
   Thanks a lot for this! Just distributed it over the team and comments
 are
   great :)
  
   Best regards,
   Dejan
  
   On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com
   wrote:
  
For your reading pleasure!
   
PDF 3.3MB uploaded at (the mailing list has a cap of 1MB
 attachments):
   
   
  
 
 https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1
   
   
Appreciate if you can spare some time to peruse this little
 experiment
  of
mine to use Comics as a medium to explain computer science topics.
 This
particular issue explains the protocols and internals of HDFS.
   
I am eager to hear your opinions on the usefulness of this visual
  medium
   to
teach complex protocols and algorithms.
   
[My personal motivations: I have always found text descriptions to be
  too
verbose as lot of effort is spent putting the concepts in proper
   time-space
context (which can be easily avoided in a visual medium); sequence
   diagrams
are unwieldy for non-trivial protocols, and they do not explain
  concepts;
and finally, animations/videos happen too fast and do not offer
self-paced learning experience.]
   
All forms of criticisms, comments (and encouragements) welcome :)
   
Thanks
Maneesh
   
  
 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this 

RE: No HADOOP COMMON HOME set.

2011-11-17 Thread GOEKE, MATTHEW (AG/1000)
Jay,

Did you download stable (0.20.203.X) or 0.23? From what I can tell, after 
looking in the tarball for 0.23, it is a different setup then 0.20 (e.g. 
hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the 
documentation you referenced below is for setting up 0.20.

I would suggest you go back and download stable and then the setup 
documentation you are following will make a lot more sense :)

Matt

-Original Message-
From: Jay Vyas [mailto:jayunit...@gmail.com] 
Sent: Thursday, November 17, 2011 2:07 PM
To: common-user@hadoop.apache.org
Subject: No HADOOP COMMON HOME set.

Hi guys : I followed the exact directions on the hadoop installation guide
for psuedo-distributed mode
here
http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration

However, I get that several environmental variables are not set (for
example , HaDOOP_COMMON_HOME is not set)

Also, hadoop reported thatHADOOP CONF was not set, as well.

Im wondering wether there is a resource on how to set environmental
variables to run hadoop ?

Thanks.

-- 
Jay Vyas
MMSB/UCHC
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: updated example

2011-10-11 Thread GOEKE, MATTHEW (AG/1000)
The old API is still fully usable in 0.20.204.

Matt

-Original Message-
From: Jignesh Patel [mailto:jign...@websoft.com] 
Sent: Tuesday, October 11, 2011 12:17 PM
To: common-user@hadoop.apache.org
Subject: Re: updated example

Thea means old API is not integrated in 0.20.204.0??

When do you expect the release of 0.20.205?

-Jignesh
On Oct 11, 2011, at 12:32 PM, Tom White wrote:

 JobConf and the old API are no longer deprecated in the forthcoming
 0.20.205 release, so you can continue to use it without issue.
 
 The equivalent in the new API is setInputFormatClass() on
 org.apache.hadoop.mapreduce.Job.
 
 Cheers,
 Tom
 
 On Tue, Oct 11, 2011 at 9:18 AM, Keith Thompson kthom...@binghamton.edu 
 wrote:
 I see that the JobConf class used in the WordCount tutorial is deprecated
 for the Configuration class.  I am wanting to change the file input format
 (to the StreamInputFormat for XML as in Hadoop: The Definitive Guide pp.
 212-213) but I don't see a setInputFormat method in the Configuration class
 as there was in the JobConf class.  Is there an updated example using the
 non-deprecated classes and methods?  I have searched but not found one.
 
 Regards,
 Keith
 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Learning curve after MapReduce and HDFS

2011-09-30 Thread GOEKE, MATTHEW (AG/1000)
Are you learning for the sake of experimenting or are there functional 
requirements driving you to dive into this space?

*If you are learning for the sake of adding new tools to your portfolio: Look 
into high level overviews of each of the projects and review architecture 
solutions that use them. Focus on how they interact and target ones that peak 
your curiosity the most.

*If you are learning the ecosystem to fulfill some customer requirements then 
just learn the pieces as you need them. Compare the high level differences 
between the sub projects and let the requirements drive which pieces you focus 
on.

There are plenty of training videos out there (for free) that go over quite a 
few of the pieces. I recently came across 
https://www.db2university.com/courses/auth/openid/login.php which has a basic 
set of reference materials that reviews a few of the sub projects within the 
eco system with included labs. Yahoo developer network and Cloudera also have 
some great resources as well.

Any one of us could point you in a certain direction but it is all a matter of 
opinion. Compare your needs with each of the sub projects and that should 
filter the list down to a manageable size.

Matt
-Original Message-
From: Varad Meru [mailto:meru.va...@gmail.com] 
Sent: Friday, September 30, 2011 11:19 AM
To: common-user@hadoop.apache.org; Varad Meru
Subject: Learning curve after MapReduce and HDFS

Hi all,

I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the 
past 8 months. 

Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase 
...

Can you suggest me a learning path to learn about the Hadoop Eco-System in a 
structured manner?
I am confused between so many alternatives such as 
Hive vs Jaql vs Pig
HBase vs Hypertable vs Cassandra
And many other projects which are similar to each other.  

Thanks in advance,
Varad


---
Varad Meru
Software Engineer
Persistent Systems and Solutions Ltd. 
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: dump configuration

2011-09-28 Thread GOEKE, MATTHEW (AG/1000)
You could always check the web-ui job history for that particular run, open the 
job.xml, and search for what the value of that parameter was at runtime.

Matt

-Original Message-
From: patrick sang [mailto:silvianhad...@gmail.com] 
Sent: Wednesday, September 28, 2011 4:00 PM
To: common-user@hadoop.apache.org
Subject: dump configuration

Hi hadoopers,

I was looking the way to dump hadoop configuration in order to check if what
i have just changed
in mapred-site.xml is really kicked in.

Found that HADOOP-6184
https://issues.apache.org/jira/browse/HADOOP-6184is exactly what i
want but the thing is I am running CDH3u0 which is
0.20.2 based.

I wonder if anyone here have a magic to dump the hadoop configuration;
doesn't need to be json
as long as i can check if what i changed in configuration file is really
kicked in.

PS, i change this mapred.user.jobconf.limit

-P
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Temporary Files to be sent to DistributedCache

2011-09-27 Thread GOEKE, MATTHEW (AG/1000)
The simplest route I can think of is to ingest the data directly into HDFS 
using Sqoop if there is a driver currently made for your database. At that 
point it would be relatively simple just to read directly from HDFS in your MR 
code. 

Matt

-Original Message-
From: lessonz [mailto:less...@q.com] 
Sent: Tuesday, September 27, 2011 4:48 PM
To: common-user@hadoop.apache.org
Subject: Temporary Files to be sent to DistributedCache

I have a need to write information retrieved from a database to a series of
files that need to be made available to my mappers. Because each mapper
needs access to all of these files, I want to put them in the
DistributedCache. Is there a preferred method to writing new information to
the DistributedCache? I can use Java's File.createTempFile(String prefix,
String suffix), but that uses the system default temporary folder. While
that should usually work, I'd rather have a method that doesn't depend on
writing to the local file system before copying files to the
DistributedCache. As I'm extremely new to Hadoop, I hope I'm not missing
something obvious.

Thank you for your time.
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Environment consideration for a research on scheduling

2011-09-23 Thread GOEKE, MATTHEW (AG/1000)
If you are starting from scratch with no prior Hadoop install experience I 
would configure stand-alone, migrate to pseudo distributed and then to fully 
distributed verifying functionality at each step by doing a simple word count 
run. Also, if you don't mind using the CDH distribution then SCM / their rpms 
will greatly simplify both the bin installs as well as the user creation.

Your VM route will most likely work but I can imagine the amount of hiccups 
during migration from that to the real cluster will not make it worth your time.

Matt 

-Original Message-
From: Merto Mertek [mailto:masmer...@gmail.com] 
Sent: Friday, September 23, 2011 10:00 AM
To: common-user@hadoop.apache.org
Subject: Environment consideration for a research on scheduling

Hi,
in the first phase we are planning to establish a small cluster with few
commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
version 0.20.203 with missing
librarieshttp://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567).
Would you suggest any other version?

In the second phase we are planning to analyse, test and modify some of
hadoop schedulers.

Now I am interested what is the best way to deploy ubuntu and hadop to this
few machine. I was thinking to configure the system in the local VM and then
converting it to each physical machine but probably this is not the best
option. If you know any other please share..

Thanks you!
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Question regarding Oozie and Hive replication / backup

2011-09-22 Thread GOEKE, MATTHEW (AG/1000)
I would like to have a robust setup for anything residing on our edge nodes, 
which is where these two daemons will be, and I was curious if anyone had any 
suggestions around how to replicate / keep an active clone of the metadata for 
these components. We already use DRBD and a vip to get around this issue for 
our master nodes and I know this would work for the edge nodes but I wanted to 
make sure I wasn't overlooking any options.

Hive: Currently tinkering with the built-in db but evaluating whether to go 
with a dedicated MySQL or PostgreSQL instance so suggestions can reference 
either solution.

Thanks,
Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Hadoop RPC and general serialization question

2011-09-22 Thread GOEKE, MATTHEW (AG/1000)
I was reviewing a video from Hadoop Summit 2011[1] where Arun Murthy mentioned 
that MRv2 was moving towards protocol buffers as the wire format but I feel 
like this is contrary to an Avro presentation that Doug Cutting did back in 
Hadoop World '09[2]. I haven't stayed up to date with the Jira for MRv2 but is 
there a disagreement between contributors as to which format will be the de 
facto standard going forward and if so what are the biggest points of 
contention? The only reason I bring this up is I am trying to integrate a 
serialization framework into our best practices and, while I am currently 
working towards Avro, this disconnect caused a little concern.

Matt

*1 - http://www.youtube.com/watch?v=2FpO7w6X41I
*2 - http://www.cloudera.com/videos/hw09_next_steps_for_hadoop



















This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: risks of using Hadoop

2011-09-21 Thread GOEKE, MATTHEW (AG/1000)
I would completely agree with Mike's comments with one addition: Hadoop centers 
around how to manipulate the flow of data in a way to make the framework work 
for your specific problem. There are recipes for common problems but depending 
on your domain that might solve only 30-40% of your use cases. It should take 
little to no time for a good java dev to understand how to make an MR program. 
It will take significantly more time for that java dev to understand the domain 
and Hadoop well enough to consistently write *good* MR programs. Mike listed 
some great ways to cut down on that curve but you really want someone who has 
not only an affinity for code but can also apply the critical thinking to how 
you should pipeline your data. If you plan on using it purely with Pig/Hive 
abstractions on top then this can be negated significantly.

Some my might disagree but that is my $0.02
Matt 

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Wednesday, September 21, 2011 12:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


 Date: Wed, 21 Sep 2011 17:02:45 +0100
 Subject: Re: risks of using Hadoop
 From: kobina.kwa...@gmail.com
 To: common-user@hadoop.apache.org
 
 Jignesh,
 
 Will your point 2 still be valid if we hire very experienced Java
 programmers?
 
 Kobina.
 
 On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote:
 
 
  @Kobina
  1. Lack of skill set
  2. Longer learning curve
  3. Single point of failure
 
 
  @Uma
  I am curious to know about .20.2 is that stable? Is it same as the one you
  mention in your email(Federation changes), If I need scaled nameNode and
  append support, which version I should choose.
 
  Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
  updating the Hadoop API. When that will be integrated with Hadoop.
 
  If I need
 
 
  -Jignesh
 
  On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
 
   Hi Kobina,
  
   Some experiences which may helpful for you with respective to DFS.
  
   1. Selecting the correct version.
  I will recommend to use 0.20X version. This is pretty stable version
  and all other organizations prefers it. Well tested as well.
   Dont go for 21 version.This version is not a stable version.This is risk.
  
   2. You should perform thorough test with your customer operations.
(of-course you will do this :-))
  
   3. 0.20x version has the problem of SPOF.
 If NameNode goes down you will loose the data.One way of recovering is
  by using the secondaryNameNode.You can recover the data till last
  checkpoint.But here manual intervention is required.
   In latest trunk SPOF will be addressed bu HDFS-1623.
  
   4. 0.20x NameNodes can not scale. Federation changes included in latest
  versions. ( i think in 22). this may not be the problem for your cluster.
  But please consider this aspect as well.
  
   5. Please select the hadoop version depending on your security
  requirements. There are versions available for security as well in 0.20X.
  
   6. If you plan to use Hbase, it requires append support. 20Append has the
  support for append. 0.20.205 release also will have append support but not
  yet released. Choose your correct version to avoid sudden surprises.
  
  
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 3:42 am
   Subject: Re: risks of using Hadoop
   To: common-user@hadoop.apache.org
  
   We are planning to use Hadoop in my organisation for quality of
   servicesanalysis out of CDR records from mobile operators. We are
   thinking of having
   a small cluster of may be 10 - 15 nodes and I'm preparing the
   proposal. my
   office requires that i provide some risk analysis in the proposal.
  
   thank you.
  
   On 16 September 2011 20:34, Uma Maheswara Rao G 72686
   mahesw...@huawei.comwrote:
  
   Hello,
  
   First of all where you are planning to use Hadoop?
  
   Regards,
   Uma
   - Original Message -
   From: Kobina Kwarko kobina.kwa...@gmail.com
   Date: Saturday, September 17, 2011 0:41 am
   Subject: risks 

RE: how to set the number of mappers with 0 reducers?

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
Amusingly this is almost the same question that was asked the other day :)

quote from Owen O'Malley
There isn't currently a way of getting a collated, but unsorted list of 
key/value pairs. For most applications, the in memory sort is fairly cheap 
relative to the shuffle and other parts of the processing.
/quote

If you know that you will be filtering out a significant amount of information 
to the point where shuffle will be trivial then the impact of a reduce phase 
should be minimal using an identity reducer. It is either that aggregate as 
much data as you feel comfortable with into each split and have 1 file per map. 

How much data/percentage of input are you assuming will be output from each of 
these maps?

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 10:22 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Thank you all for the quick reply!!

I think I was wrong. It has nothing to do with the number of mappers
because each input file has size 500M, which is not too small in terms
of 64M per block.

The problem is that the output from each mapper is too small. Is there a
way to combine some mappers output together? Setting the number of
reducers to 1 might get a very huge file. Can I set the number of
reducers to 100, but skip sorting, shuffling...etc.?

Wei

-Original Message-
From: Soumya Banerjee [mailto:soumya.sbaner...@gmail.com] 
Sent: Tuesday, September 20, 2011 2:06 AM
To: common-user@hadoop.apache.org
Subject: Re: how to set the number of mappers with 0 reducers?.

Hi,

If you want all your map outputs in a single file you can use a
IdentityReducer and set the number of reducers to 1.
This would ensure that all your mapper output goes into the reducer and
it
wites into a single file.

Soumya

On Tue, Sep 20, 2011 at 2:04 PM, Harsh J ha...@cloudera.com wrote:

 Hello Wei!

 On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei wei.p...@xerox.com wrote:
 (snip)
  However, the output from the mappers result in many small files
(size is
  ~50k, the block size is however 64M, so it wastes a lot of space).
 
  How can I set the number of mappers (say 100)?

 What you're looking for is to 'pack' several files per mapper, if I
 get it right.

 In that case, you need to check out the CombineFileInputFormat. It can
 pack several files per mapper (with some degree of locality).

 Alternatively, pass a list of files (as a text file) as your input,
 and have your Mapper logic read them one by one. This way, if you
 divide 50k filenames over 100 files, you will get 100 mappers as you
 want - but at the cost of losing almost all locality.

  If there is no way to set the number of mappers, the only way to
solve
  it is cat some files together?

 Concatenating is an alternative, if affordable - yes. You can lower
 the file count (down from 50k) this way.

 --
 Harsh J

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: how to set the number of mappers with 0 reducers?

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
There is currently no way to disable S/S. You can do many things to alleviate 
any issues you have with it though, one of them you mentioned below. Is there a 
reason why you are allowing each of your keys to be unique? If it is truly 
because you do not care then just create an even distribution of keys that you 
assign to allow for more aggregation.

On a side note, what is the actual stack trace you are getting when the 
reducers fail and what is the reducer doing? I think for your use case using a 
reduce phase is the best way to go, as long as the job time meets your SLA, so 
we need to figure out why the job is failing.

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 10:44 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

The input is 9010 files (each 500MB), and I would estimate the output to
be around 50GB.
My hadoop job failed because of out of memory (with 66 reducers). I
guess that the key from each mapper output is unique so the sorting
would be memory-intensive. 
Although I can set another key to reduce the number of unique keys, I am
curious if there is a way to disable sorting/shuffling.

Thanks,
Wei

-Original Message-
From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com] 
Sent: Tuesday, September 20, 2011 8:34 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Amusingly this is almost the same question that was asked the other day
:)

quote from Owen O'Malley
There isn't currently a way of getting a collated, but unsorted list of
key/value pairs. For most applications, the in memory sort is fairly
cheap relative to the shuffle and other parts of the processing.
/quote

If you know that you will be filtering out a significant amount of
information to the point where shuffle will be trivial then the impact
of a reduce phase should be minimal using an identity reducer. It is
either that aggregate as much data as you feel comfortable with into
each split and have 1 file per map. 

How much data/percentage of input are you assuming will be output from
each of these maps?

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 10:22 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Thank you all for the quick reply!!

I think I was wrong. It has nothing to do with the number of mappers
because each input file has size 500M, which is not too small in terms
of 64M per block.

The problem is that the output from each mapper is too small. Is there a
way to combine some mappers output together? Setting the number of
reducers to 1 might get a very huge file. Can I set the number of
reducers to 100, but skip sorting, shuffling...etc.?

Wei

-Original Message-
From: Soumya Banerjee [mailto:soumya.sbaner...@gmail.com] 
Sent: Tuesday, September 20, 2011 2:06 AM
To: common-user@hadoop.apache.org
Subject: Re: how to set the number of mappers with 0 reducers?.

Hi,

If you want all your map outputs in a single file you can use a
IdentityReducer and set the number of reducers to 1.
This would ensure that all your mapper output goes into the reducer and
it
wites into a single file.

Soumya

On Tue, Sep 20, 2011 at 2:04 PM, Harsh J ha...@cloudera.com wrote:

 Hello Wei!

 On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei wei.p...@xerox.com wrote:
 (snip)
  However, the output from the mappers result in many small files
(size is
  ~50k, the block size is however 64M, so it wastes a lot of space).
 
  How can I set the number of mappers (say 100)?

 What you're looking for is to 'pack' several files per mapper, if I
 get it right.

 In that case, you need to check out the CombineFileInputFormat. It can
 pack several files per mapper (with some degree of locality).

 Alternatively, pass a list of files (as a text file) as your input,
 and have your Mapper logic read them one by one. This way, if you
 divide 50k filenames over 100 files, you will get 100 mappers as you
 want - but at the cost of losing almost all locality.

  If there is no way to set the number of mappers, the only way to
solve
  it is cat some files together?

 Concatenating is an alternative, if affordable - yes. You can lower
 the file count (down from 50k) this way.

 --
 Harsh J

This e-mail message may contain privileged and/or confidential
information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error,
please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other
use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring,
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible

RE: Using HBase for real time transaction

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
In order to answer you first question we would need to know what types of data 
you plan on storing and latency requirements. If it is 
semistructured/unstructured data then HBase *can* be a great fit but I have 
seen very few cases where you will want to scrap your RDBMS completely. Most 
organizations that use HBase will still have a need for a RDBMS/MPP solution 
for real time access to structured data.

Matt 

-Original Message-
From: Jignesh Patel [mailto:jign...@websoft.com] 
Sent: Tuesday, September 20, 2011 4:25 PM
To: common-user@hadoop.apache.org
Subject: Re: Using HBase for real time transaction

Tom,
Let me reword: can HBase be used as a transactional database(i.e. in 
replacement of mysql)?

The requirement is to have real time read and write operations. I mean as soon 
as data is written the user should see the data(Here data should be written in 
Hbase).

-Jignesh


On Sep 20, 2011, at 5:11 PM, Tom Deutsch wrote:

 Real-time means different things to different people. Can you share your 
 latency requirements from the time the data is generated to when it needs 
 to be consumed, or how you are thinking of using Hbase in the overall 
 flow?
 
 
 Tom Deutsch
 Program Director
 CTO Office: Information Management
 Hadoop Product Manager / Customer Exec
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com
 
 
 
 
 Jignesh Patel jign...@websoft.com 
 09/20/2011 12:57 PM
 Please respond to
 common-user@hadoop.apache.org
 
 
 To
 common-user@hadoop.apache.org
 cc
 
 Subject
 Using HBase for real time transaction
 
 
 
 
 
 
 We are exploring possibility of using HBase for the real time 
 transactions. Is that possible?
 
 -Jignesh
 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: phases of Hadoop Jobs

2011-09-19 Thread GOEKE, MATTHEW (AG/1000)
Was the command line output really ever intended to be *that* verbose? I can 
see how it would be useful but considering how easy it is to interact with the 
web-ui I can't see why much effort should be put into enhancing it. Even if you 
didn't want to see all of the details the web-ui has to offer it doesn't take 
long to learn how to skim it and get 10x more accurate reading on your job 
progress.

Matt

-Original Message-
From: Arun C Murthy [mailto:a...@hortonworks.com] 
Sent: Sunday, September 18, 2011 11:27 PM
To: common-user@hadoop.apache.org
Subject: Re: phases of Hadoop Jobs

Agreed.

At least, I believe the new web-ui for MRv2 is (or will be soon) more verbose 
about this.

On Sep 18, 2011, at 9:23 PM, Kai Voigt wrote:

 Hi,
 
 this 0-33-66-100% phases are really confusing to beginners. We see that in 
 our training classes. The output should be more verbose, such as breaking 
 down the phases into seperate progress numbers.
 
 Does that make sense?
 
 Am 19.09.2011 um 06:17 schrieb Arun C Murthy:
 
 Nan,
 
 The 'phase' is implicitly understood by the 'progress' (value) made by the 
 map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
 
 For e.g. 
 Reduce: 
 0-33% - Shuffle
 34-66% - Sort (actually, just 'merge', there is no sort in the reduce since 
 all map-outputs are sorted)
 67-100% - Reduce
 
 With 0.23 onwards the Map has phases too:
 0-90% - Map
 91-100% - Final Sort/merge
 
 Now,about starting reduces early - this is done to ensure shuffle can 
 proceed for completed maps while rest of the maps run, there-by pipelining 
 shuffle and map completion. There is a 'reduce slowstart' feature to control 
 this - by default, reduces aren't started until 5% of maps are complete. 
 Users can set this higher.
 
 Arun
 
 On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
 
 Hi, all
 
 recently, I was hit by a question, how is a hadoop job divided into 2
 phases?,
 
 In textbooks, we are told that the mapreduce jobs are divided into 2 phases,
 map and reduce, and for reduce, we further divided it into 3 stages,
 shuffle, sort, and reduce, but in hadoop codes, I never think about
 this question, I didn't see any variable members in JobInProgress class
 to indicate this information,
 
 and according to my understanding on the source code of hadoop, the reduce
 tasks are unnecessarily started until all mappers are finished, in
 constract, we can see the reduce tasks are in shuffle stage while there are
 mappers which are still in running,
 So how can I indicate the phase which the job is belonging to?
 
 Thanks
 -- 
 Nan Zhu
 School of Electronic, Information and Electrical Engineering,229
 Shanghai Jiao Tong University
 800,Dongchuan Road,Shanghai,China
 E-Mail: zhunans...@gmail.com
 
 
 
 -- 
 Kai Voigt
 k...@123.org
 
 
 
 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Hadoop/CDH + Avro

2011-09-13 Thread GOEKE, MATTHEW (AG/1000)
Would anyone happen to be able to share a good reference for Avro integration 
with Hadoop? I can find plenty of material around using Avro by itself but I 
have found little to no documentation on how to implement it as both the 
protocol and as custom key/value types.

Thanks,
Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Hadoop multi tier backup

2011-08-30 Thread GOEKE, MATTHEW (AG/1000)
All,

We were discussing how we would backup our data from the various environments 
we will have and I was hoping someone could chime in with previous experience 
in this. My primary concern about our cluster is that we would like to be able 
to recover anything within the last 60 days so having full backups both on tape 
and through distcp is preferred.

Out initial thoughts can be seen in the jpeg attached but just in case any of 
you are weary of attachments it can also be summarized below:

Prod Cluster --DistCp-- On-site Backup cluster with Fuse mount point running 
NetBackup daemon --NetBackup-- Media Server -- Tape

One of our biggest grey areas so far is how do most people accomplish 
incremental backups? Our thought was to tie this into our NetBackup 
configuration as this can be done for other connectors but we do not see 
anything for HDFS yet.

Thanks,
Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: Hadoop in process?

2011-08-26 Thread GOEKE, MATTHEW (AG/1000)
It depends on what scope you want your unit tests to operate at. There is a 
class you might want to look into called MiniMRCluster if you are dead set on 
having as deep of tests as possible but you can still cover quite a bit with 
MRUnit and Junit4/Mockito.

Matt

-Original Message-
From: Frank Astier [mailto:fast...@yahoo-inc.com] 
Sent: Friday, August 26, 2011 1:30 PM
To: common-user@hadoop.apache.org
Subject: Hadoop in process?

Hi -

Is there a way I can start HDFS (the namenode) from a Java main and run unit 
tests against that? I need to integrate my Java/HDFS program into unit tests, 
and the unit test machine might not have Hadoop installed. I'm currently 
running the unit tests by hand with hadoop jar ... My unit tests create a bunch 
of (small) files in HDFS and manipulate them. I use the fs API for that. I 
don't have map/reduce jobs (yet!).

Thanks!

Frank
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Making sure I understand HADOOP_CLASSPATH

2011-08-22 Thread GOEKE, MATTHEW (AG/1000)
If you are asking how to make those classes available at run time you can 
either use the -libjars command for the distributed cache or you can just shade 
those classes into your jar using maven. I have had enough issues in the past 
with classpath being flaky that I prefer the shading method but obviously that 
is not the preferred route.

Matt

-Original Message-
From: W.P. McNeill [mailto:bill...@gmail.com] 
Sent: Monday, August 22, 2011 1:01 PM
To: common-user@hadoop.apache.org
Subject: Making sure I understand HADOOP_CLASSPATH

What does HADOOP_CLASSPATH set in $HADOOP/conf/hadoop-env.sh do?

This isn't clear to me from documentation and books, so I did some
experimenting. Here's the conclusion I came to: the paths in
HADOOP_CLASSPATH are added to the class path of the Job Client, but they are
not added to the class path of the Task Trackers. Therefore if you put a JAR
called MyJar.jar on the HADOOP_CLASSPATH and don't do anything to make it
available to the Task Trackers as well, calls to MyJar.jar code from the
run() method of your job work, but calls from your Mapper or Reducer will
fail at runtime. Is this correct?

If it is, what is the proper way to make MyJar.jar available to both the Job
Client and the Task Trackers?
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: hadoop cluster on VM's

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
Is this just for testing purposes or are you planning on going into production 
with this? If it is the latter than I would STRONGLY advise to not give that a 
second thought due to how the framework handles I/O. However if you are just 
trying to test out distributed daemon setup and get some ops documentation then 
have at it :)

Matt

-Original Message-
From: Travis Camechis [mailto:camec...@gmail.com] 
Sent: Monday, August 15, 2011 12:45 PM
To: common-user@hadoop.apache.org
Subject: hadoop cluster on VM's

Is it recommended to install a hadoop cluster on a set of VM's that are all
connected to a SAN?

Thanks,
Travis
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: hadoop cluster on VM's

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
I was referring to multiple VM's on a single machine (that you have in house) 
for my previous comment and not EC2. FWIW, I would rather see a single heavy 
data node than to partition off a single box into multiple machines unless you 
are trying to do more on that server than just Hadoop. Obviously every person / 
company has their own constraints but if this box is solely for Hadoop then 
don't partition it otherwise you will incur a decent loss in possible 
map/reduce slots.

Matt

-Original Message-
From: Liam Friel [mailto:liam.fr...@gmail.com] 
Sent: Monday, August 15, 2011 3:04 PM
To: common-user@hadoop.apache.org
Subject: Re: hadoop cluster on VM's

On Mon, Aug 15, 2011 at 7:31 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 Is this just for testing purposes or are you planning on going into
 production with this? If it is the latter than I would STRONGLY advise to
 not give that a second thought due to how the framework handles I/O. However
 if you are just trying to test out distributed daemon setup and get some ops
 documentation then have at it :)

 Matt

 -Original Message-
 From: Travis Camechis [mailto:camec...@gmail.com]
 Sent: Monday, August 15, 2011 12:45 PM
 To: common-user@hadoop.apache.org
 Subject: hadoop cluster on VM's

 Is it recommended to install a hadoop cluster on a set of VM's that are all
 connected to a SAN?


Could you expand on that? Do you mean multiple VMs on a single server are a
no-no?
Or do you mean running Hadoop on something like Amazon EC2 for production is
also a no-no?
With some pointers to background if the latter please ...

Just for my education. I have run some (test I guess you could call them)
Hadoop clusters on EC2 and it was working OK.
However I didn't have the equivalent pile of physical hardware lying around
to do a comparison ... which I guess is why EC2 is so attractive.

Ta
Liam
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Unit testing MR without dependency injection

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
Does anyone have any code examples for how they persist join data across 
multiple input splits and how they test it? Currently I populate a singleton in 
the setup method of my mapper (along with having jvm reuse turned on for this 
job) but with no way to have dependency injection into the mapper I am really 
having a hard time with wrapping a UT around the code. I could have a package 
scoped setter simply for testing purposes but that just feels dirty to be 
honest. Any help is greatly appreciated and I have both MRUnit and Mockito at 
my disposal.

  private BitPackedMarkerMap markerMap = 
BitPackedMarkerMapSingleton.getInstance().getMarkerMap();
  private int numberOfIndividuals = -999;
  private int numberOfAlleles = -999;

  @Override
  protected void setup(Context context) throws IOException, 
InterruptedException {
LongPackedDoubleInteger inputSizes;
if(markerMap.getSize() == 0){
FileInputStream scoresInputStream = null;
  try{
Path[] cacheFiles = 
DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (cacheFiles != null  cacheFiles.length  0){
  scoresInputStream = new FileInputStream(cacheFiles[0].toString());
  inputSizes = markerMap.parse(scoresInputStream);
  numberOfIndividuals = inputSizes.getInt1();
  numberOfAlleles = inputSizes.getInt2();
}
  } catch (IOException e){
System.err.println(Exception reading DistributedCache:  + e);
throw e;
  }finally {
if(scoresInputStream != null){
  scoresInputStream.close();
}
  }
}
  }




Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread GOEKE, MATTHEW (AG/1000)
Sofia correct me if I am wrong, but Mike I think this thread was about using 
the output of a previous job, in this case already in sequence file format, as 
in memory join data for another job.

Side note: does anyone know what the rule of thumb on file size is when using 
the distributed cache vs just reading from HDFS (join data not binary files)? I 
always thought that having a setup phase on a mapper read directly from HDFS 
was a asking for trouble and that you should always distribute to each node but 
I am hearing more and more people say to just read directly from HDFS for 
larger file sizes to avoid the IO cost of the distributed cache.

Matt

-Original Message-
From: Ian Michael Gumby [mailto:michael_se...@hotmail.com] 
Sent: Friday, August 12, 2011 10:54 AM
To: common-user@hadoop.apache.org
Subject: RE: Hadoop--store a sequence file in distributed cache?


This whole thread doesn't make a lot of sense.

If your first m/r job creates the sequence files, which you then use as input 
files to your second job, you don't need to use distributed cache since the 
output of the first m/r job is going to be in HDFS.
(Dino is correct on that account.)

Sofia replied saying that she needed to open and close the sequence file to 
access the data in each Mapper.map() call. 
Without knowing more about the specific app, Ashook is correct that you could 
read the file in Mapper.setup() and then access it in memory.
Joey is correct you can put anything in distributed cache, but you don't want 
to put an HDFS file in to distributed cache. Distributed cache is a tool for 
taking something from your job and distributing it to each job tracker as a 
local object. It does have a bit of overhead. 

A better example is if you're distributing binary objects  that you want on 
each node. A c++ .so file that you want to call from within your java m/r.

If you're not using all of the data in the sequence file, what about using 
HBase?


 From: ash...@clearedgeit.com
 To: common-user@hadoop.apache.org
 Date: Fri, 12 Aug 2011 09:06:39 -0400
 Subject: RE: Hadoop--store a sequence file in distributed cache?
 
 If you are looking for performance gains, then possibly reading these files 
 once during the setup() call in your Mapper and storing them in some data 
 structure like a Map or a List will give you benefits.  Having to open/close 
 the files during each map call will have a lot of unneeded I/O.  
 
 You have to be conscious of your java heap size though since you are 
 basically storing the files in RAM. If your files are a few MB in size as you 
 said, then it shouldn't be a problem.  If the amount of data you need to 
 store won't fit, consider using HBase as a solution to get access to the data 
 you need.
 
 But as Joey said, you can put whatever you want in the Distributed Cache -- 
 as long as you have a reader for it.  You should have no problems using the 
 SequenceFile.Reader.
 
 -- Adam
 
 
 -Original Message-
 From: Joey Echeverria [mailto:j...@cloudera.com] 
 Sent: Friday, August 12, 2011 6:28 AM
 To: common-user@hadoop.apache.org; Sofia Georgiakaki
 Subject: Re: Hadoop--store a sequence file in distributed cache?
 
 You can use any kind of format for files in the distributed cache, so
 yes you can use sequence files. They should be faster to parse than
 most text formats.
 
 -Joey
 
 On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
 geosofie_...@yahoo.com wrote:
  Thank you for the reply!
  In each map(), I need to open-read-close these files (more than 2 in the 
  general case, and maybe up to 20 or more), in order to make some checks. 
  Considering the huge amount of data in the input, making all these file 
  operations on HDFS will kill the performance!!! So I think it would be 
  better to store these files in distributed Cache, so that the whole process 
  would be more efficient -I guess this is the point of using Distributed 
  Cache in the first place!
 
  My question is, if I can store sequence files in distributed Cache and 
  handle them using e.g. the SequenceFile.Reader class, or if I should only 
  keep regular text files in distributed Cache and handle them using the 
  usual java API.
 
  Thank you very much
  Sofia
 
  PS: The files have small size, a few KB to few MB maximum.
 
 
 
  
  From: Dino Kečo dino.k...@gmail.com
  To: common-user@hadoop.apache.org; Sofia Georgiakaki 
  geosofie_...@yahoo.com
  Sent: Friday, August 12, 2011 11:30 AM
  Subject: Re: Hadoop--store a sequence file in distributed cache?
 
  Hi Sofia,
 
  I assume that output of first job is stored on HDFS. In that case I would
  directly read file from Mappers without using distributed cache. If you put
  file into distributed cache that would add one more copy operation into your
  process.
 
  Thanks,
  dino
 
 
  On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
  geosofie_...@yahoo.comwrote:
 
  Good morning,
 
  I would like to store some files in the 

RE: Question about RAID controllers and hadoop

2011-08-11 Thread GOEKE, MATTHEW (AG/1000)
My assumption would be that having a set of 4 raid 0 disks would actually be 
better than having a controller that allowed pure JBOD of 4 disks due to the 
cache on the controller. If anyone has any personal experience with this I 
would love to know performance numbers but our infrastructure guy is doing 
tests on exactly this over the next couple days so I will pass it along once we 
have it.

Matt

-Original Message-
From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] 
Sent: Thursday, August 11, 2011 5:00 PM
To: common-user@hadoop.apache.org
Subject: Re: Question about RAID controllers and hadoop

True, you need a P410 controller. You can create RAID0 for each disk to make it 
as JBOD.


-Bharath




From: Koert Kuipers ko...@tresata.com
To: common-user@hadoop.apache.org
Sent: Thursday, August 11, 2011 2:50 PM
Subject: Question about RAID controllers and hadoop

Hello all,
We are considering using low end HP proliant machines (DL160s and DL180s)
for cluster nodes. However with these machines if you want to do more than 4
hard drives then HP puts in a P410 raid controller. We would configure the
RAID controller to function as JBOD, by simply creating multiple RAID
volumes with one disk. Does anyone have experience with this setup? Is it a
good idea, or am i introducing a i/o bottleneck?
Thanks for your help!
Best, Koert
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Giving filename as key to mapper ?

2011-07-15 Thread GOEKE, MATTHEW (AG/1000)
If you have the source downloaded (and if you don't I would suggest you get it) 
you can do a search for *InputFormat.java and you will have all the references 
you need. Also you might want to check out http://codedemigod.com/blog/?p=120 
or take a look at the books Hadoop in action or Hadoop: The Definitive 
Guide.

Matt

-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Friday, July 15, 2011 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Giving filename as key to mapper ?

I am new to this hadoop API. Can anyone give me some tutorial or code snipet
on how to write your own input format to do these kind of things.
Thanks.

On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans ev...@yahoo-inc.com wrote:

 To add to that if you really want the file name to be the key instead of
 just calling a different API in your map to get it you will probably need to
 write your own input format to do it.  It should be fairly simple and you
 can base it off of an existing input format to do it.

 --Bobby

 On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote:

 You can retrieve the filename in the new API as described here:


 http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename

 In the old API, its available in the configuration instance of the
 mapper as key map.input.file. See the table below this section

 http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
 for more such goodies.

 On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com
 wrote:
  Hi,
  How can I give filename as key to mapper ?
  I want to know the occurence of word in set of docs, so I want to keep
 key
  as filename. Is it possible to give input key as filename in map function
 ?
  Thanks,
  Praveenesh
 



 --
 Harsh J


This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Issue with MR code not scaling correctly with data sizes

2011-07-14 Thread GOEKE, MATTHEW (AG/1000)
All,

I have a MR program that I feed in a list of IDs and it generates the unique 
comparison set as a result. Example: if I have a list {1,2,3,4,5} then the 
resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or 
(n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets 
(I can verify less than 1000 fairly easily) but fails when I try to push the 
set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
1) Partition the IDs evenly, based on amount of output per value, into 
a set of keys equal to the number of reduce slots we currently have
2) Use the distributed cache to push the ID file out to the various 
reducers
3) In the setup of the reducer, populate an int array with the values 
from the ID file in distributed cache
4) Output a comparison only if the current ID from the values iterator 
is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an 
oozie workflow so it made sense to just do it in MR for now. My issue is that 
when I try the larger sized ID files it only outputs part of resulting data set 
and there are no errors to be found. Part of me thinks that I need to tweak 
some site configuration properties, due to the size of data that is spilling to 
disk, but after scanning through all 3 sites I am having issues pin pointing 
anything I think could be causing this. I moved from reading the file from HDFS 
to using the distributed cache for the join read thinking that might solve my 
problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Performance Tunning

2011-06-28 Thread GOEKE, MATTHEW (AG/1000)
Mike,

Somewhat of a tangent but it is actually very informative to hear that you are 
getting bound by I/O with a 2:1 core to disk ratio. Could you share what you 
used to make those calls? We have been using both a local ganglia daemon as 
well as the Hadoop ganglia daemon to get an overall look at the cluster and the 
items of interest, I would assume, would be CPU wait i/o as well as the 
throughput of block operations.

Obviously the disconnect on my side was I didn't realize you were dedicating a 
physical core per daemon. I am a little surprised that you found that necessary 
but then again after seeing some of the metrics from my own stress testing I am 
noticing that we might be over extending with our config on heavy loads. 
Unfortunately I am working with lower specced hardware at the moment so I don't 
have the overhead to test that out.

Matt

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Tuesday, June 28, 2011 1:31 PM
To: common-user@hadoop.apache.org
Subject: RE: Performance Tunning



Matthew,

I understood that Juan was talking about a 2 socket quad core box.  We run 
boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores. 
Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration. 

What I was saying was that if you consider 1 core for each TT, DN and RS jobs, 
thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread 
cores'.
So you could put up 10 m/r slots on the machine.  Note that on the main tasks 
(TT, DN, RS) I dedicate the physical core.

Of course your mileage may vary if you're doing non-standard or normal things.  
A good starting point is 6 mappers and 4 reducers. 
And of course YMMV depending on if you're using MapR's release, Cloudera, and 
if you're running HBase or something else on the cluster.

From our experience... we end up getting disk I/O bound first, and then 
network or memory becomes the next constraint. Really the xeon chipsets are 
really good. 

HTH

-Mike


 From: matthew.go...@monsanto.com
 To: common-user@hadoop.apache.org
 Subject: RE: Performance Tunning
 Date: Tue, 28 Jun 2011 14:46:40 +
 
 Mike,
 
 I'm not really sure I have seen a community consensus around how to handle 
 hyper-threading within Hadoop (although I have seen quite a few articles that 
 discuss it). I was assuming that when Juan mentioned they were 4-core boxes 
 that he meant 4 physical cores and not HT cores. I was more stating that the 
 starting point should be 1 slot per thread (or hyper-threaded core) but 
 obviously reviewing the results from ganglia, or any other monitoring 
 solution, will help you come up with a more concrete configuration based on 
 the load.
 
 My brain might not be working this morning but how did you get the 10 slots 
 again? That seems low for an 8 physical core box but somewhat overextending 
 for a 4 physical core box.
 
 Matt
 
 -Original Message-
 From: im_gu...@hotmail.com [mailto:im_gu...@hotmail.com] On Behalf Of Michel 
 Segel
 Sent: Tuesday, June 28, 2011 7:39 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Performance Tunning
 
 Matt,
 You have 2 threads per core, so your Linux box thinks an 8 core box has16 
 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a 
 thread per slot so you end up w 10 slots per node. Of course memory is also a 
 factor.
 
 Note this is only a starting point.you can always tune up. 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Jun 27, 2011, at 11:11 PM, GOEKE, MATTHEW (AG/1000) 
 matthew.go...@monsanto.com wrote:
 
  Per node: 4 cores * 2 processes = 8 slots
  Datanode: 1 slot
  Tasktracker: 1 slot
  
  Therefore max of 6 slots between mappers and reducers.
  
  Below is part of our mapred-site.xml. The thing to keep in mind is the 
  number of maps is defined by the number of input splits (which is defined 
  by your data) so you only need to worry about setting the maximum number of 
  concurrent processes per node. In this case the property you want to hone 
  in on is mapred.tasktracker.map.tasks.maximum and 
  mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of 
  other tuning improvements that can be made but it requires an strong 
  understanding of your job load.
  
  configuration
   property
 namemapred.tasktracker.map.tasks.maximum/name
 value2/value
   /property
  
   property
 namemapred.tasktracker.reduce.tasks.maximum/name
 value1/value
   /property
  
   property
 namemapred.child.java.opts/name
 value-Xmx512m/value
   /property
  
   property
 namemapred.compress.map.output/name
 valuetrue/value
   /property
  
   property
 namemapred.output.compress/name
 valuetrue/value
   /property
  
  
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have

RE: Queue support from HDFS

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Saumitra,

Two questions come to mind that could help you narrow down a solution:

1) How quickly do the downstream processes need the transformed data?
Reason: If you can delay the processing for a period of time, enough to 
batch the data into a blob that is a multiple of your block size, then you are 
obviously going to be working more towards the strong suit of vanilla MR.

2) What else will be running on the cluster?
Reason: If this is primarily setup for this use case then how often it 
runs / what resources it consumes when it does only needs to be optimized if it 
can't process them fast enough. If it is not then you could always setup a 
separate pool for this in the fairscheduler and allow for this to use a certain 
amount of overhead on the cluster when these events are being generated.

Outside of the fact that you would have a lot of small files on the cluster 
(which can be resolved by running a nightly job to blob them and then delete 
originals) I am not sure I would be too concerned about at least trying out 
this method. It would be helpful to know the size and type of data coming in as 
well as what type of operation you are looking to do if you would like a more 
concrete suggestion. Log data is a prime example of this type of workflow and 
there are many suggestions out there as well as projects that attempt to 
address this (i.e. Chukwa). 

HTH,
Matt

-Original Message-
From: saumitra.shahap...@gmail.com [mailto:saumitra.shahap...@gmail.com] On 
Behalf Of Saumitra Shahapure
Sent: Friday, June 24, 2011 12:12 PM
To: common-user@hadoop.apache.org
Subject: Queue support from HDFS

Hi,

Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
identity).
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.

I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)

Thanks,
-- 
Saumitra S. Shahapure
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Performance Tunning

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
If you are running default configurations then you are only getting 2 mappers 
and 1 reducer per node. The rule of thumb I have gone on (and back up by the 
definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots 
left. How you break it up from there is your call but I would suggest either 4 
mappers / 2 reducers or 5 mappers / 1 reducer.

Check out the below configs for details on what you are *most likely* running 
currently:
http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

HTH,
Matt

-Original Message-
From: Juan P. [mailto:gordoslo...@gmail.com] 
Sent: Monday, June 27, 2011 2:50 PM
To: common-user@hadoop.apache.org
Subject: Performance Tunning

I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
cores each.
My input data is 4GB in size and it's split into 100MB files. Current
configuration is default so block size is 64MB.

If I understand it correctly Hadoop should be running 64 Mappers to process
the data.

I'm running a simple data counting MapReduce and it's taking about 30mins to
complete. This seems like way too much, doesn't it?
Is there any tunning you guys would recommend to try and see an improvement
in performance?

Thanks,
Pony
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Did you make sure to define the datanode/tasktracker in the slaves file in your 
conf directory and push that to both machines? Also have you checked the logs 
on either to see if there are any errors?

Matt

-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 3:24 PM
To: HADOOP MLIST
Subject: Why I cannot see live nodes in a LAN-based cluster setup?

Hi Everyone:

I am quite new to hadoop here. I am attempting to set up Hadoop locally in
two machines, connected by LAN. Both of them pass the single-node test.
However, I failed in two-node cluster setup, as shown in the 2 cases below:

1) set one as dedicated namenode and the other as dedicated datanode
2) set one as both name- and data-node, and the other as just datanode

I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues
cleared, thus I can always observe the startup of daemon in every datanode.
However, by website of *http://(URI of namenode):50070 *it shows only 0 live
node for (1) and 1 live node for (2), which is the same as the output by
command-line *hadoop dfsadmin -report*

Generally it appears that from the namenode you cannot observe the remote
datanode alive, let alone a normal across-node MapReduce execution.

Could anyone give some hints / instructions at this point? I really
appreciate it!

Thank.

Best Regards
Yours Sincerely

Jingwei Lu
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
. Already tried 4 time(s).
 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s).
 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s).
 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s).
 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s).
 18 2011-06-27 13:45:11,678 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 9 time(s).
 19 2011-06-27 13:45:11,679 INFO org.apache.hadoop.ipc.RPC: Server at
hdl.ucsd.edu/127.0.0.1:54310 not available yet, Z...


(just guess, is this due to some porting problem?)

Any comments will be greatly appreciated!

Best Regards
Yours Sincerely

Jingwei Lu



On Mon, Jun 27, 2011 at 1:28 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 Did you make sure to define the datanode/tasktracker in the slaves file in
 your conf directory and push that to both machines? Also have you checked
 the logs on either to see if there are any errors?

 Matt

 -Original Message-
 From: Jingwei Lu [mailto:j...@ucsd.edu]
 Sent: Monday, June 27, 2011 3:24 PM
 To: HADOOP MLIST
 Subject: Why I cannot see live nodes in a LAN-based cluster setup?

 Hi Everyone:

 I am quite new to hadoop here. I am attempting to set up Hadoop locally in
 two machines, connected by LAN. Both of them pass the single-node test.
 However, I failed in two-node cluster setup, as shown in the 2 cases below:

 1) set one as dedicated namenode and the other as dedicated datanode
 2) set one as both name- and data-node, and the other as just datanode

 I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues
 cleared, thus I can always observe the startup of daemon in every datanode.
 However, by website of *http://(URI of namenode):50070 *it shows only 0
 live
 node for (1) and 1 live node for (2), which is the same as the output by
 command-line *hadoop dfsadmin -report*

 Generally it appears that from the namenode you cannot observe the remote
 datanode alive, let alone a normal across-node MapReduce execution.

 Could anyone give some hints / instructions at this point? I really
 appreciate it!

 Thank.

 Best Regards
 Yours Sincerely

 Jingwei Lu
 This e-mail message may contain privileged and/or confidential information,
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.



RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
At this point if that is the correct ip then I would see if you can actually 
ssh from the DN to the NN to make sure it can actually connect to the other 
box. If you can successfully connect through ssh then it's just a matter of 
figuring out why that port is having issues (netstat is your friend in this 
case). If you see it listening on 54310 then just power cycle the box and try 
again.

Matt

-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 5:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

Hi Matt and Jeff:

Thanks a lot for your instructions. I corrected the mistakes in conf files
of DN, and now the log on DN becomes:

2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s).
2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s).
2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s).
2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s).
2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s).
2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s).
2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s).
2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s).
2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s).
2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s).
2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at
clock.ucsd.edu/132.239.95.91:54310 not available yet, Z...

Seems DN is trying to bind with NN but always fails...



Best Regards
Yours Sincerely

Jingwei Lu



On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 As a follow-up to what Jeff posted: go ahead and ignore the message you got
 on the NN for now.

 If you look at the address that the DN log shows it is 127.0.0.1 and the
 ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it
 is trying to bind to itself as if it was still in single machine mode. Make
 sure that you have correctly pushed the URI for the NN into the config files
 on both machines and then bounce DFS.

 Matt

 -Original Message-
 From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com]
 Sent: Monday, June 27, 2011 4:08 PM
 To: common-user@hadoop.apache.org
 Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup?

 http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html



 -Original Message-
 From: Jingwei Lu [mailto:j...@ucsd.edu]
 Sent: Monday, June 27, 2011 3:58 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

 Hi,

 I just manually modify the masters  slaves files in the both machines.

 I found something wrong in the log files, as shown below:

 -- Master :
 namenote.log:

 
 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
 SelectChannelConnector@0.0.0.0:50070
 2011-06-27 13:44:47,395 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
 0.0.0.0:50070
 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
 Responder: starting
 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 54310: starting
 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 0 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 1 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 2 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 3 on 54310: starting
 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 54310: starting
 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 5 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 6 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 7 on 54310

RE: Poor scalability with map reduce application

2011-06-21 Thread GOEKE, MATTHEW (AG/1000)
Harsh,

Is it possible for mapred.reduce.slowstart.completed.maps to even play a 
significant role in this? The only benefit he would find in tweaking that for 
his problem would be to spread network traffic from the shuffle over a longer 
period of time at a cost of having the reducer using resources earlier. Either 
way he would see this effect across both sets of runs if he is using the 
default parameters. I guess it would all depend on what kind of network layout 
the cluster is on.

Matt

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Tuesday, June 21, 2011 12:09 PM
To: common-user@hadoop.apache.org
Subject: Re: Poor scalability with map reduce application

Alberto,

On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti
albertoandreo...@gmail.com wrote:
 I don't know if speculatives maps are on, I'll check it. One thing I
 observed is that reduces begin before all maps have finished. Let me check
 also if the difference is on the map side or in the reduce. I believe it's
 balanced, both are slower when adding more nodes, but i'll confirm that.

Maps and reduces are speculative by default, so must've been ON. Could
you also post a general input vs. output record counts and statistics
like that between your job runs, to correlate?

The reducers get scheduled early but do not exactly reduce() until
all maps are done. They just keep fetching outputs. Their scheduling
can be controlled with some configurations (say, to start only after
X% of maps are done -- by default it starts up when 5% of maps are
done).

-- 
Harsh J
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: large memory tasks

2011-06-15 Thread GOEKE, MATTHEW (AG/1000)
Is the lookup table constant across each of the tasks? You could try putting it 
into memcached:

http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf

Matt

-Original Message-
From: Ian Upright [mailto:i...@upright.net] 
Sent: Wednesday, June 15, 2011 3:42 PM
To: common-user@hadoop.apache.org
Subject: large memory tasks

Hello, I'm quite new to Hadoop, so I'd like to get an understanding of
something.

Lets say I have a task that requires 16gb of memory, in order to execute.
Lets say hypothetically it's some sort of big lookuptable of sorts that
needs that kind of memory.

I could have 8 cores run the task in parallel (multithreaded), and all 8
cores can share that 16gb lookup table.

On another machine, I could have 4 cores run the same task, and they still
share that same 16gb lookup table.

Now, with my understanding of Hadoop, each task has it's own memory.

So if I have 4 tasks that run on one machine, and 8 tasks on another, then
the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, but
really, lets say I only have two machines, one with 4 cores and one with 8,
each machine only having 24 GB.

How can the work be evenly distributed among these machines?  Am I missing
something?  What other ways can this be configured such that this works
properly?

Thanks, Ian
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.