RE: Experience with Hadoop in production
I would add that it also depends on how thoroughly you have vetted your use cases. If you have already ironed out how ad-hoc access works, Kerberos vs Firewall and network segmentation, how code submission works, procedures for various operational issues, backup of your data, etc (the list is a couple hundred bullets long at minimum...) on your current cluster then there might be little need for that support. However if you are hoping to figure that stuff out still then you could potentially be in a world of hurt when you attempt the transition with just your own staff. It also helps to have that outside advice in certain situations to resolve cross department conflicts when it comes to how the cluster will be implemented :) Matt -Original Message- From: Mike Lyon [mailto:mike.l...@gmail.com] Sent: Thursday, February 23, 2012 2:33 PM To: common-user@hadoop.apache.org Subject: Re: Experience with Hadoop in production Just be sure you have that corporate card available 24x7 when you need to call support ;) Sent from my iPhone On Feb 23, 2012, at 10:30, Serge Blazhievsky serge.blazhiyevs...@nice.com wrote: What I have seen companies do often is that they will use free version of the commercial vendor and only get their support if there are major problems that they cannot solve on their own. That way you will get free distribution and insurance that you have support if something goes wrong. Serge On 2/23/12 10:42 AM, Jamack, Peter pjam...@consilium1.com wrote: A lot of it depends on your staff and their experiences. Maybe they don't have hadoop, but if they were involved with large databases, data warehouse, etc they can utilize their skills experiences and provide a lot of help. If you have linux admins, system admins, network admins with years of experience, they will be a goldmine.At the other end, database developers who know SQL, programmers who know Java, and so on can really help staff up your 'big data' team. Having a few people who know ETL would be great too. The biggest problem I've run into seems to be how big the Hadoop project/team is or is not. Sometimes it's just an 'experimental' department and therefore half the people are only 25-50 percent available to help out. And if they aren't really that knowledgeable about hadoop, it tends to be one of those, not enough time in the day scenarios. And the few people dedicated to the Hadoop project(s) will get the brunt of the work. It's like any ecosystem. To do it right, you might need system/network admins, a storage person to actually know how to set up the proper storage architecture, maybe a security expert, a few programmers, and a few data people. If you're combining analytics, that's another group. Of course most companies outside the Google and Facebooks of the world, will have a few people dedicated to Hadoop. Which means you need somebody who knows storage, knows networking, knows linux, knows how to be a system admin, knows security, and maybe other things(AKA if you have a firewall issue, somebody needs to figure out ways to make it work through or around), and then you need some programmes who either know MapReduce or can pretty much figure it out because they've done java for years. Peter J On 2/23/12 10:17 AM, Pavel Frolov pfro...@gmail.com wrote: Hi, We are going into 24x7 production soon and we are considering whether we need vendor support or not. We use a free vendor distribution of Cluster Provisioning + Hadoop + HBase and looked at their Enterprise version but it is very expensive for the value it provides (additional functionality + support), given that we¹ve already ironed out many of our performance and tuning issues on our own and with generous help from the community (e.g. all of you). So, I wanted to run it through the community to see if anybody can share their experience of running a Hadoop cluster (50+ nodes with Apache releases or Vendor distributions) in production, with in-house support only, and how difficult it was. How many people were involved, etc.. Regards, Pavel This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to
RE: Large server recommedations
Dale, Talking solely about hadoop core you will only need to run 4 daemons on that machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to run multiple of any of them as the tasktracker will spawn multiple child jvms which is where you will get your task parallelism. When you set your mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper bound of the child jvm creation but this needs to be configured based on job profile (I don't know much about Mahoot but traditionally I setup the clusters as 2:1 mappers to reducers until the profile proves otherwise). If you look at blogs / archives you will see that you can assign 1 child task per *logical* core (e.g. hyper threaded core) and to be safe you will want 1 daemon per *physical* core so you can divvy it up based on that recommendation. To summarize the above: if you are sharing the same IO pipe / box then there is no reason to have multiple daemons running because you are not really gaining anything from that level of granularity. Others might disagree based on virtualization but in your case I would say save yourself the headache and keep it simple. Matt -Original Message- From: Dale McDiarmid [mailto:d...@ravn.co.uk] Sent: Thursday, December 15, 2011 1:50 PM To: common-user@hadoop.apache.org Subject: Large server recommedations Hi all New to the community and using hadoop and was looking for some advice as to optimal configurations on very large servers. I have a single server with 48 cores and 512GB of RAM and am looking to perform an LDA analysis using Mahoot across approx 180 million documents. I have configured my namenode and job tracker. My questions are primarily around the optimal number of tasktrackers and data nodes. I have had no issues configuring multiple datanodes, each which could potentially be utilised its own disk location (underlying disk is SAN - solid state). However, from my reading the typical architecture for hadoop is a larger number of smaller nodes with a single tasktracker on each host. Could someone please clarify the following: 1. Can multiple task trackers be run on a single host? If so, how is this configured as it doesn't seem possible to control the host:port. 2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker parameters? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering the box spec? 3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum - do these control the number of JVM processes spawned to handle the respective steps? Is a tasktracker with 48 configured equivalent to a 48 task trackers with a value of 1 configured for these values? 4. Benefits of a large number of datanodes on a single large server? I can see value where the host has multiple IO interfaces and disk sets to avoid IO contention. In my case, however, a SAN negates this. Are there still benefits of multiple datanodes outside of resiliency and potential increase of data transfer i.e. assuming a single data node is limited and single threaded? 5. Any other thoughts/recommended settings? Thanks Dale This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Large server recommedations
mapred.map.tasks is a suggestion to the engine and there is really no reason to define it as it will be driven by the block level partitioning of your files (e.g. if you have a file that is 30 blocks then it will by default spawn 30 map tasks). As for mapred.reduce.tasks, just set it to whatever you set your mapred.tasktracker.reduce.tasks.maximum (reasoning being you are running all of this on a single tasktracker so these two should essentially line up). By now I should be able to answer whether those are JT level vs TT level parameters but I have heard one thing and personally experienced another so I will leave that answer up to someone who can confirm 100%. Either way I would recommend that your JT and TT sites not deviate from each other for clarity but you can change mapred.reduce.tasks at the app level so if you have something that needs a global sort order you can invoke it as mapred.reduce.tasks=1 using job level conf. Matt From: Dale McDiarmid [mailto:d...@ravn.co.uk] Sent: Thursday, December 15, 2011 3:58 PM To: common-user@hadoop.apache.org Cc: GOEKE, MATTHEW [AG/1000] Subject: Re: Large server recommedations thanks matt, Assuming therefore i run a single tasktracker and have 48 cores available. Based on your recommendation of 2:1 mappers to reducer threads i will be assigning: mapred.tasktracker.map.tasks.maximum=30 mapred.tasktracker.reduce.tasks.maximum=15 This brings me onto my question: Can i confirm mapred.map.tasks and mapred.reduce.tasks are these JobTracker parameters? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering the box spec? I have read: mapred.local.tasks = 10x of task trackers mapred.reduce.tasks=2x task trackers Given i have a single task tracker, with multiple concurrent processes does this equates to: mapred.local.tasks =300? mapred.reduce.tasks=30? Some reasoning behind these values appreciated... appreciate this is a little simplified and we will need to profile. Just looking for a sensible starting position. Thanks Dale On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote: Dale, Talking solely about hadoop core you will only need to run 4 daemons on that machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to run multiple of any of them as the tasktracker will spawn multiple child jvms which is where you will get your task parallelism. When you set your mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper bound of the child jvm creation but this needs to be configured based on job profile (I don't know much about Mahoot but traditionally I setup the clusters as 2:1 mappers to reducers until the profile proves otherwise). If you look at blogs / archives you will see that you can assign 1 child task per *logical* core (e.g. hyper threaded core) and to be safe you will want 1 daemon per *physical* core so you can divvy it up based on that recommendation. To summarize the above: if you are sharing the same IO pipe / box then there is no reason to have multiple daemons running because you are not really gaining anything from that level of granularity. Others might disagree based on virtualization but in your case I would say save yourself the headache and keep it simple. Matt -Original Message- From: Dale McDiarmid [mailto:d...@ravn.co.uk] Sent: Thursday, December 15, 2011 1:50 PM To: common-user@hadoop.apache.orgmailto:common-user@hadoop.apache.org Subject: Large server recommedations Hi all New to the community and using hadoop and was looking for some advice as to optimal configurations on very large servers. I have a single server with 48 cores and 512GB of RAM and am looking to perform an LDA analysis using Mahoot across approx 180 million documents. I have configured my namenode and job tracker. My questions are primarily around the optimal number of tasktrackers and data nodes. I have had no issues configuring multiple datanodes, each which could potentially be utilised its own disk location (underlying disk is SAN - solid state). However, from my reading the typical architecture for hadoop is a larger number of smaller nodes with a single tasktracker on each host. Could someone please clarify the following: 1. Can multiple task trackers be run on a single host? If so, how is this configured as it doesn't seem possible to control the host:port. 2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker parameters? The recommendation for these settings seems to related to the number of task trackers. In my architecture, i have potentially only 1 if a single task tracker can only be configured on each host. What should i set these values to therefore considering
RE: HDFS Explained as Comics
Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.comwrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com wrote: Hi Maneesh, Thanks a lot for this! Just distributed it over the team and comments are great :) Best regards, Dejan On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com wrote: For your reading pleasure! PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments): https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1 Appreciate if you can spare some time to peruse this little experiment of mine to use Comics as a medium to explain computer science topics. This particular issue explains the protocols and internals of HDFS. I am eager to hear your opinions on the usefulness of this visual medium to teach complex protocols and algorithms. [My personal motivations: I have always found text descriptions to be too verbose as lot of effort is spent putting the concepts in proper time-space context (which can be easily avoided in a visual medium); sequence diagrams are unwieldy for non-trivial protocols, and they do not explain concepts; and finally, animations/videos happen too fast and do not offer self-paced learning experience.] All forms of criticisms, comments (and encouragements) welcome :) Thanks Maneesh This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this
RE: No HADOOP COMMON HOME set.
Jay, Did you download stable (0.20.203.X) or 0.23? From what I can tell, after looking in the tarball for 0.23, it is a different setup then 0.20 (e.g. hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the documentation you referenced below is for setting up 0.20. I would suggest you go back and download stable and then the setup documentation you are following will make a lot more sense :) Matt -Original Message- From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: Thursday, November 17, 2011 2:07 PM To: common-user@hadoop.apache.org Subject: No HADOOP COMMON HOME set. Hi guys : I followed the exact directions on the hadoop installation guide for psuedo-distributed mode here http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration However, I get that several environmental variables are not set (for example , HaDOOP_COMMON_HOME is not set) Also, hadoop reported thatHADOOP CONF was not set, as well. Im wondering wether there is a resource on how to set environmental variables to run hadoop ? Thanks. -- Jay Vyas MMSB/UCHC This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: updated example
The old API is still fully usable in 0.20.204. Matt -Original Message- From: Jignesh Patel [mailto:jign...@websoft.com] Sent: Tuesday, October 11, 2011 12:17 PM To: common-user@hadoop.apache.org Subject: Re: updated example Thea means old API is not integrated in 0.20.204.0?? When do you expect the release of 0.20.205? -Jignesh On Oct 11, 2011, at 12:32 PM, Tom White wrote: JobConf and the old API are no longer deprecated in the forthcoming 0.20.205 release, so you can continue to use it without issue. The equivalent in the new API is setInputFormatClass() on org.apache.hadoop.mapreduce.Job. Cheers, Tom On Tue, Oct 11, 2011 at 9:18 AM, Keith Thompson kthom...@binghamton.edu wrote: I see that the JobConf class used in the WordCount tutorial is deprecated for the Configuration class. I am wanting to change the file input format (to the StreamInputFormat for XML as in Hadoop: The Definitive Guide pp. 212-213) but I don't see a setInputFormat method in the Configuration class as there was in the JobConf class. Is there an updated example using the non-deprecated classes and methods? I have searched but not found one. Regards, Keith This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Learning curve after MapReduce and HDFS
Are you learning for the sake of experimenting or are there functional requirements driving you to dive into this space? *If you are learning for the sake of adding new tools to your portfolio: Look into high level overviews of each of the projects and review architecture solutions that use them. Focus on how they interact and target ones that peak your curiosity the most. *If you are learning the ecosystem to fulfill some customer requirements then just learn the pieces as you need them. Compare the high level differences between the sub projects and let the requirements drive which pieces you focus on. There are plenty of training videos out there (for free) that go over quite a few of the pieces. I recently came across https://www.db2university.com/courses/auth/openid/login.php which has a basic set of reference materials that reviews a few of the sub projects within the eco system with included labs. Yahoo developer network and Cloudera also have some great resources as well. Any one of us could point you in a certain direction but it is all a matter of opinion. Compare your needs with each of the sub projects and that should filter the list down to a manageable size. Matt -Original Message- From: Varad Meru [mailto:meru.va...@gmail.com] Sent: Friday, September 30, 2011 11:19 AM To: common-user@hadoop.apache.org; Varad Meru Subject: Learning curve after MapReduce and HDFS Hi all, I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the past 8 months. Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase ... Can you suggest me a learning path to learn about the Hadoop Eco-System in a structured manner? I am confused between so many alternatives such as Hive vs Jaql vs Pig HBase vs Hypertable vs Cassandra And many other projects which are similar to each other. Thanks in advance, Varad --- Varad Meru Software Engineer Persistent Systems and Solutions Ltd. This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: dump configuration
You could always check the web-ui job history for that particular run, open the job.xml, and search for what the value of that parameter was at runtime. Matt -Original Message- From: patrick sang [mailto:silvianhad...@gmail.com] Sent: Wednesday, September 28, 2011 4:00 PM To: common-user@hadoop.apache.org Subject: dump configuration Hi hadoopers, I was looking the way to dump hadoop configuration in order to check if what i have just changed in mapred-site.xml is really kicked in. Found that HADOOP-6184 https://issues.apache.org/jira/browse/HADOOP-6184is exactly what i want but the thing is I am running CDH3u0 which is 0.20.2 based. I wonder if anyone here have a magic to dump the hadoop configuration; doesn't need to be json as long as i can check if what i changed in configuration file is really kicked in. PS, i change this mapred.user.jobconf.limit -P This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Temporary Files to be sent to DistributedCache
The simplest route I can think of is to ingest the data directly into HDFS using Sqoop if there is a driver currently made for your database. At that point it would be relatively simple just to read directly from HDFS in your MR code. Matt -Original Message- From: lessonz [mailto:less...@q.com] Sent: Tuesday, September 27, 2011 4:48 PM To: common-user@hadoop.apache.org Subject: Temporary Files to be sent to DistributedCache I have a need to write information retrieved from a database to a series of files that need to be made available to my mappers. Because each mapper needs access to all of these files, I want to put them in the DistributedCache. Is there a preferred method to writing new information to the DistributedCache? I can use Java's File.createTempFile(String prefix, String suffix), but that uses the system default temporary folder. While that should usually work, I'd rather have a method that doesn't depend on writing to the local file system before copying files to the DistributedCache. As I'm extremely new to Hadoop, I hope I'm not missing something obvious. Thank you for your time. This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Environment consideration for a research on scheduling
If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM / their rpms will greatly simplify both the bin installs as well as the user creation. Your VM route will most likely work but I can imagine the amount of hiccups during migration from that to the real cluster will not make it worth your time. Matt -Original Message- From: Merto Mertek [mailto:masmer...@gmail.com] Sent: Friday, September 23, 2011 10:00 AM To: common-user@hadoop.apache.org Subject: Environment consideration for a research on scheduling Hi, in the first phase we are planning to establish a small cluster with few commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server 10.10 and a hadoop build from the branch 0.20.204 (i had some issues with version 0.20.203 with missing librarieshttp://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567). Would you suggest any other version? In the second phase we are planning to analyse, test and modify some of hadoop schedulers. Now I am interested what is the best way to deploy ubuntu and hadop to this few machine. I was thinking to configure the system in the local VM and then converting it to each physical machine but probably this is not the best option. If you know any other please share.. Thanks you! This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Question regarding Oozie and Hive replication / backup
I would like to have a robust setup for anything residing on our edge nodes, which is where these two daemons will be, and I was curious if anyone had any suggestions around how to replicate / keep an active clone of the metadata for these components. We already use DRBD and a vip to get around this issue for our master nodes and I know this would work for the edge nodes but I wanted to make sure I wasn't overlooking any options. Hive: Currently tinkering with the built-in db but evaluating whether to go with a dedicated MySQL or PostgreSQL instance so suggestions can reference either solution. Thanks, Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Hadoop RPC and general serialization question
I was reviewing a video from Hadoop Summit 2011[1] where Arun Murthy mentioned that MRv2 was moving towards protocol buffers as the wire format but I feel like this is contrary to an Avro presentation that Doug Cutting did back in Hadoop World '09[2]. I haven't stayed up to date with the Jira for MRv2 but is there a disagreement between contributors as to which format will be the de facto standard going forward and if so what are the biggest points of contention? The only reason I bring this up is I am trying to integrate a serialization framework into our best practices and, while I am currently working towards Avro, this disconnect caused a little concern. Matt *1 - http://www.youtube.com/watch?v=2FpO7w6X41I *2 - http://www.cloudera.com/videos/hw09_next_steps_for_hadoop This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: risks of using Hadoop
I would completely agree with Mike's comments with one addition: Hadoop centers around how to manipulate the flow of data in a way to make the framework work for your specific problem. There are recipes for common problems but depending on your domain that might solve only 30-40% of your use cases. It should take little to no time for a good java dev to understand how to make an MR program. It will take significantly more time for that java dev to understand the domain and Hadoop well enough to consistently write *good* MR programs. Mike listed some great ways to cut down on that curve but you really want someone who has not only an affinity for code but can also apply the critical thinking to how you should pipeline your data. If you plan on using it purely with Pig/Hive abstractions on top then this can be negated significantly. Some my might disagree but that is my $0.02 Matt -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Wednesday, September 21, 2011 12:48 PM To: common-user@hadoop.apache.org Subject: RE: risks of using Hadoop Kobina The points 1 and 2 are definitely real risks. SPOF is not. As I pointed out in my mini-rant to Tom was that your end users / developers who use the cluster can do more harm to your cluster than a SPOF machine failure. I don't know what one would consider a 'long learning curve'. With the adoption of any new technology, you're talking at least 3-6 months based on the individual and the overall complexity of the environment. Take anyone who is a strong developer, put them through Cloudera's training, plus some play time, and you've shortened the learning curve. The better the java developer, the easier it is for them to pick up Hadoop. I would also suggest taking the approach of hiring a senior person who can cross train and mentor your staff. This too will shorten the runway. HTH -Mike Date: Wed, 21 Sep 2011 17:02:45 +0100 Subject: Re: risks of using Hadoop From: kobina.kwa...@gmail.com To: common-user@hadoop.apache.org Jignesh, Will your point 2 still be valid if we hire very experienced Java programmers? Kobina. On 20 September 2011 21:07, Jignesh Patel jign...@websoft.com wrote: @Kobina 1. Lack of skill set 2. Longer learning curve 3. Single point of failure @Uma I am curious to know about .20.2 is that stable? Is it same as the one you mention in your email(Federation changes), If I need scaled nameNode and append support, which version I should choose. Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is updating the Hadoop API. When that will be integrated with Hadoop. If I need -Jignesh On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote: Hi Kobina, Some experiences which may helpful for you with respective to DFS. 1. Selecting the correct version. I will recommend to use 0.20X version. This is pretty stable version and all other organizations prefers it. Well tested as well. Dont go for 21 version.This version is not a stable version.This is risk. 2. You should perform thorough test with your customer operations. (of-course you will do this :-)) 3. 0.20x version has the problem of SPOF. If NameNode goes down you will loose the data.One way of recovering is by using the secondaryNameNode.You can recover the data till last checkpoint.But here manual intervention is required. In latest trunk SPOF will be addressed bu HDFS-1623. 4. 0.20x NameNodes can not scale. Federation changes included in latest versions. ( i think in 22). this may not be the problem for your cluster. But please consider this aspect as well. 5. Please select the hadoop version depending on your security requirements. There are versions available for security as well in 0.20X. 6. If you plan to use Hbase, it requires append support. 20Append has the support for append. 0.20.205 release also will have append support but not yet released. Choose your correct version to avoid sudden surprises. Regards, Uma - Original Message - From: Kobina Kwarko kobina.kwa...@gmail.com Date: Saturday, September 17, 2011 3:42 am Subject: Re: risks of using Hadoop To: common-user@hadoop.apache.org We are planning to use Hadoop in my organisation for quality of servicesanalysis out of CDR records from mobile operators. We are thinking of having a small cluster of may be 10 - 15 nodes and I'm preparing the proposal. my office requires that i provide some risk analysis in the proposal. thank you. On 16 September 2011 20:34, Uma Maheswara Rao G 72686 mahesw...@huawei.comwrote: Hello, First of all where you are planning to use Hadoop? Regards, Uma - Original Message - From: Kobina Kwarko kobina.kwa...@gmail.com Date: Saturday, September 17, 2011 0:41 am Subject: risks
RE: how to set the number of mappers with 0 reducers?
Amusingly this is almost the same question that was asked the other day :) quote from Owen O'Malley There isn't currently a way of getting a collated, but unsorted list of key/value pairs. For most applications, the in memory sort is fairly cheap relative to the shuffle and other parts of the processing. /quote If you know that you will be filtering out a significant amount of information to the point where shuffle will be trivial then the impact of a reduce phase should be minimal using an identity reducer. It is either that aggregate as much data as you feel comfortable with into each split and have 1 file per map. How much data/percentage of input are you assuming will be output from each of these maps? Matt -Original Message- From: Peng, Wei [mailto:wei.p...@xerox.com] Sent: Tuesday, September 20, 2011 10:22 AM To: common-user@hadoop.apache.org Subject: RE: how to set the number of mappers with 0 reducers? Thank you all for the quick reply!! I think I was wrong. It has nothing to do with the number of mappers because each input file has size 500M, which is not too small in terms of 64M per block. The problem is that the output from each mapper is too small. Is there a way to combine some mappers output together? Setting the number of reducers to 1 might get a very huge file. Can I set the number of reducers to 100, but skip sorting, shuffling...etc.? Wei -Original Message- From: Soumya Banerjee [mailto:soumya.sbaner...@gmail.com] Sent: Tuesday, September 20, 2011 2:06 AM To: common-user@hadoop.apache.org Subject: Re: how to set the number of mappers with 0 reducers?. Hi, If you want all your map outputs in a single file you can use a IdentityReducer and set the number of reducers to 1. This would ensure that all your mapper output goes into the reducer and it wites into a single file. Soumya On Tue, Sep 20, 2011 at 2:04 PM, Harsh J ha...@cloudera.com wrote: Hello Wei! On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei wei.p...@xerox.com wrote: (snip) However, the output from the mappers result in many small files (size is ~50k, the block size is however 64M, so it wastes a lot of space). How can I set the number of mappers (say 100)? What you're looking for is to 'pack' several files per mapper, if I get it right. In that case, you need to check out the CombineFileInputFormat. It can pack several files per mapper (with some degree of locality). Alternatively, pass a list of files (as a text file) as your input, and have your Mapper logic read them one by one. This way, if you divide 50k filenames over 100 files, you will get 100 mappers as you want - but at the cost of losing almost all locality. If there is no way to set the number of mappers, the only way to solve it is cat some files together? Concatenating is an alternative, if affordable - yes. You can lower the file count (down from 50k) this way. -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: how to set the number of mappers with 0 reducers?
There is currently no way to disable S/S. You can do many things to alleviate any issues you have with it though, one of them you mentioned below. Is there a reason why you are allowing each of your keys to be unique? If it is truly because you do not care then just create an even distribution of keys that you assign to allow for more aggregation. On a side note, what is the actual stack trace you are getting when the reducers fail and what is the reducer doing? I think for your use case using a reduce phase is the best way to go, as long as the job time meets your SLA, so we need to figure out why the job is failing. Matt -Original Message- From: Peng, Wei [mailto:wei.p...@xerox.com] Sent: Tuesday, September 20, 2011 10:44 AM To: common-user@hadoop.apache.org Subject: RE: how to set the number of mappers with 0 reducers? The input is 9010 files (each 500MB), and I would estimate the output to be around 50GB. My hadoop job failed because of out of memory (with 66 reducers). I guess that the key from each mapper output is unique so the sorting would be memory-intensive. Although I can set another key to reduce the number of unique keys, I am curious if there is a way to disable sorting/shuffling. Thanks, Wei -Original Message- From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com] Sent: Tuesday, September 20, 2011 8:34 AM To: common-user@hadoop.apache.org Subject: RE: how to set the number of mappers with 0 reducers? Amusingly this is almost the same question that was asked the other day :) quote from Owen O'Malley There isn't currently a way of getting a collated, but unsorted list of key/value pairs. For most applications, the in memory sort is fairly cheap relative to the shuffle and other parts of the processing. /quote If you know that you will be filtering out a significant amount of information to the point where shuffle will be trivial then the impact of a reduce phase should be minimal using an identity reducer. It is either that aggregate as much data as you feel comfortable with into each split and have 1 file per map. How much data/percentage of input are you assuming will be output from each of these maps? Matt -Original Message- From: Peng, Wei [mailto:wei.p...@xerox.com] Sent: Tuesday, September 20, 2011 10:22 AM To: common-user@hadoop.apache.org Subject: RE: how to set the number of mappers with 0 reducers? Thank you all for the quick reply!! I think I was wrong. It has nothing to do with the number of mappers because each input file has size 500M, which is not too small in terms of 64M per block. The problem is that the output from each mapper is too small. Is there a way to combine some mappers output together? Setting the number of reducers to 1 might get a very huge file. Can I set the number of reducers to 100, but skip sorting, shuffling...etc.? Wei -Original Message- From: Soumya Banerjee [mailto:soumya.sbaner...@gmail.com] Sent: Tuesday, September 20, 2011 2:06 AM To: common-user@hadoop.apache.org Subject: Re: how to set the number of mappers with 0 reducers?. Hi, If you want all your map outputs in a single file you can use a IdentityReducer and set the number of reducers to 1. This would ensure that all your mapper output goes into the reducer and it wites into a single file. Soumya On Tue, Sep 20, 2011 at 2:04 PM, Harsh J ha...@cloudera.com wrote: Hello Wei! On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei wei.p...@xerox.com wrote: (snip) However, the output from the mappers result in many small files (size is ~50k, the block size is however 64M, so it wastes a lot of space). How can I set the number of mappers (say 100)? What you're looking for is to 'pack' several files per mapper, if I get it right. In that case, you need to check out the CombineFileInputFormat. It can pack several files per mapper (with some degree of locality). Alternatively, pass a list of files (as a text file) as your input, and have your Mapper logic read them one by one. This way, if you divide 50k filenames over 100 files, you will get 100 mappers as you want - but at the cost of losing almost all locality. If there is no way to set the number of mappers, the only way to solve it is cat some files together? Concatenating is an alternative, if affordable - yes. You can lower the file count (down from 50k) this way. -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible
RE: Using HBase for real time transaction
In order to answer you first question we would need to know what types of data you plan on storing and latency requirements. If it is semistructured/unstructured data then HBase *can* be a great fit but I have seen very few cases where you will want to scrap your RDBMS completely. Most organizations that use HBase will still have a need for a RDBMS/MPP solution for real time access to structured data. Matt -Original Message- From: Jignesh Patel [mailto:jign...@websoft.com] Sent: Tuesday, September 20, 2011 4:25 PM To: common-user@hadoop.apache.org Subject: Re: Using HBase for real time transaction Tom, Let me reword: can HBase be used as a transactional database(i.e. in replacement of mysql)? The requirement is to have real time read and write operations. I mean as soon as data is written the user should see the data(Here data should be written in Hbase). -Jignesh On Sep 20, 2011, at 5:11 PM, Tom Deutsch wrote: Real-time means different things to different people. Can you share your latency requirements from the time the data is generated to when it needs to be consumed, or how you are thinking of using Hbase in the overall flow? Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Jignesh Patel jign...@websoft.com 09/20/2011 12:57 PM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Using HBase for real time transaction We are exploring possibility of using HBase for the real time transactions. Is that possible? -Jignesh This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: phases of Hadoop Jobs
Was the command line output really ever intended to be *that* verbose? I can see how it would be useful but considering how easy it is to interact with the web-ui I can't see why much effort should be put into enhancing it. Even if you didn't want to see all of the details the web-ui has to offer it doesn't take long to learn how to skim it and get 10x more accurate reading on your job progress. Matt -Original Message- From: Arun C Murthy [mailto:a...@hortonworks.com] Sent: Sunday, September 18, 2011 11:27 PM To: common-user@hadoop.apache.org Subject: Re: phases of Hadoop Jobs Agreed. At least, I believe the new web-ui for MRv2 is (or will be soon) more verbose about this. On Sep 18, 2011, at 9:23 PM, Kai Voigt wrote: Hi, this 0-33-66-100% phases are really confusing to beginners. We see that in our training classes. The output should be more verbose, such as breaking down the phases into seperate progress numbers. Does that make sense? Am 19.09.2011 um 06:17 schrieb Arun C Murthy: Nan, The 'phase' is implicitly understood by the 'progress' (value) made by the map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase). For e.g. Reduce: 0-33% - Shuffle 34-66% - Sort (actually, just 'merge', there is no sort in the reduce since all map-outputs are sorted) 67-100% - Reduce With 0.23 onwards the Map has phases too: 0-90% - Map 91-100% - Final Sort/merge Now,about starting reduces early - this is done to ensure shuffle can proceed for completed maps while rest of the maps run, there-by pipelining shuffle and map completion. There is a 'reduce slowstart' feature to control this - by default, reduces aren't started until 5% of maps are complete. Users can set this higher. Arun On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote: Hi, all recently, I was hit by a question, how is a hadoop job divided into 2 phases?, In textbooks, we are told that the mapreduce jobs are divided into 2 phases, map and reduce, and for reduce, we further divided it into 3 stages, shuffle, sort, and reduce, but in hadoop codes, I never think about this question, I didn't see any variable members in JobInProgress class to indicate this information, and according to my understanding on the source code of hadoop, the reduce tasks are unnecessarily started until all mappers are finished, in constract, we can see the reduce tasks are in shuffle stage while there are mappers which are still in running, So how can I indicate the phase which the job is belonging to? Thanks -- Nan Zhu School of Electronic, Information and Electrical Engineering,229 Shanghai Jiao Tong University 800,Dongchuan Road,Shanghai,China E-Mail: zhunans...@gmail.com -- Kai Voigt k...@123.org This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Hadoop/CDH + Avro
Would anyone happen to be able to share a good reference for Avro integration with Hadoop? I can find plenty of material around using Avro by itself but I have found little to no documentation on how to implement it as both the protocol and as custom key/value types. Thanks, Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Hadoop multi tier backup
All, We were discussing how we would backup our data from the various environments we will have and I was hoping someone could chime in with previous experience in this. My primary concern about our cluster is that we would like to be able to recover anything within the last 60 days so having full backups both on tape and through distcp is preferred. Out initial thoughts can be seen in the jpeg attached but just in case any of you are weary of attachments it can also be summarized below: Prod Cluster --DistCp-- On-site Backup cluster with Fuse mount point running NetBackup daemon --NetBackup-- Media Server -- Tape One of our biggest grey areas so far is how do most people accomplish incremental backups? Our thought was to tie this into our NetBackup configuration as this can be done for other connectors but we do not see anything for HDFS yet. Thanks, Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Hadoop in process?
It depends on what scope you want your unit tests to operate at. There is a class you might want to look into called MiniMRCluster if you are dead set on having as deep of tests as possible but you can still cover quite a bit with MRUnit and Junit4/Mockito. Matt -Original Message- From: Frank Astier [mailto:fast...@yahoo-inc.com] Sent: Friday, August 26, 2011 1:30 PM To: common-user@hadoop.apache.org Subject: Hadoop in process? Hi - Is there a way I can start HDFS (the namenode) from a Java main and run unit tests against that? I need to integrate my Java/HDFS program into unit tests, and the unit test machine might not have Hadoop installed. I'm currently running the unit tests by hand with hadoop jar ... My unit tests create a bunch of (small) files in HDFS and manipulate them. I use the fs API for that. I don't have map/reduce jobs (yet!). Thanks! Frank This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Making sure I understand HADOOP_CLASSPATH
If you are asking how to make those classes available at run time you can either use the -libjars command for the distributed cache or you can just shade those classes into your jar using maven. I have had enough issues in the past with classpath being flaky that I prefer the shading method but obviously that is not the preferred route. Matt -Original Message- From: W.P. McNeill [mailto:bill...@gmail.com] Sent: Monday, August 22, 2011 1:01 PM To: common-user@hadoop.apache.org Subject: Making sure I understand HADOOP_CLASSPATH What does HADOOP_CLASSPATH set in $HADOOP/conf/hadoop-env.sh do? This isn't clear to me from documentation and books, so I did some experimenting. Here's the conclusion I came to: the paths in HADOOP_CLASSPATH are added to the class path of the Job Client, but they are not added to the class path of the Task Trackers. Therefore if you put a JAR called MyJar.jar on the HADOOP_CLASSPATH and don't do anything to make it available to the Task Trackers as well, calls to MyJar.jar code from the run() method of your job work, but calls from your Mapper or Reducer will fail at runtime. Is this correct? If it is, what is the proper way to make MyJar.jar available to both the Job Client and the Task Trackers? This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: hadoop cluster on VM's
Is this just for testing purposes or are you planning on going into production with this? If it is the latter than I would STRONGLY advise to not give that a second thought due to how the framework handles I/O. However if you are just trying to test out distributed daemon setup and get some ops documentation then have at it :) Matt -Original Message- From: Travis Camechis [mailto:camec...@gmail.com] Sent: Monday, August 15, 2011 12:45 PM To: common-user@hadoop.apache.org Subject: hadoop cluster on VM's Is it recommended to install a hadoop cluster on a set of VM's that are all connected to a SAN? Thanks, Travis This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: hadoop cluster on VM's
I was referring to multiple VM's on a single machine (that you have in house) for my previous comment and not EC2. FWIW, I would rather see a single heavy data node than to partition off a single box into multiple machines unless you are trying to do more on that server than just Hadoop. Obviously every person / company has their own constraints but if this box is solely for Hadoop then don't partition it otherwise you will incur a decent loss in possible map/reduce slots. Matt -Original Message- From: Liam Friel [mailto:liam.fr...@gmail.com] Sent: Monday, August 15, 2011 3:04 PM To: common-user@hadoop.apache.org Subject: Re: hadoop cluster on VM's On Mon, Aug 15, 2011 at 7:31 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Is this just for testing purposes or are you planning on going into production with this? If it is the latter than I would STRONGLY advise to not give that a second thought due to how the framework handles I/O. However if you are just trying to test out distributed daemon setup and get some ops documentation then have at it :) Matt -Original Message- From: Travis Camechis [mailto:camec...@gmail.com] Sent: Monday, August 15, 2011 12:45 PM To: common-user@hadoop.apache.org Subject: hadoop cluster on VM's Is it recommended to install a hadoop cluster on a set of VM's that are all connected to a SAN? Could you expand on that? Do you mean multiple VMs on a single server are a no-no? Or do you mean running Hadoop on something like Amazon EC2 for production is also a no-no? With some pointers to background if the latter please ... Just for my education. I have run some (test I guess you could call them) Hadoop clusters on EC2 and it was working OK. However I didn't have the equivalent pile of physical hardware lying around to do a comparison ... which I guess is why EC2 is so attractive. Ta Liam This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Unit testing MR without dependency injection
Does anyone have any code examples for how they persist join data across multiple input splits and how they test it? Currently I populate a singleton in the setup method of my mapper (along with having jvm reuse turned on for this job) but with no way to have dependency injection into the mapper I am really having a hard time with wrapping a UT around the code. I could have a package scoped setter simply for testing purposes but that just feels dirty to be honest. Any help is greatly appreciated and I have both MRUnit and Mockito at my disposal. private BitPackedMarkerMap markerMap = BitPackedMarkerMapSingleton.getInstance().getMarkerMap(); private int numberOfIndividuals = -999; private int numberOfAlleles = -999; @Override protected void setup(Context context) throws IOException, InterruptedException { LongPackedDoubleInteger inputSizes; if(markerMap.getSize() == 0){ FileInputStream scoresInputStream = null; try{ Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration()); if (cacheFiles != null cacheFiles.length 0){ scoresInputStream = new FileInputStream(cacheFiles[0].toString()); inputSizes = markerMap.parse(scoresInputStream); numberOfIndividuals = inputSizes.getInt1(); numberOfAlleles = inputSizes.getInt2(); } } catch (IOException e){ System.err.println(Exception reading DistributedCache: + e); throw e; }finally { if(scoresInputStream != null){ scoresInputStream.close(); } } } } Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Hadoop--store a sequence file in distributed cache?
Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job. Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs just reading from HDFS (join data not binary files)? I always thought that having a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should always distribute to each node but I am hearing more and more people say to just read directly from HDFS for larger file sizes to avoid the IO cost of the distributed cache. Matt -Original Message- From: Ian Michael Gumby [mailto:michael_se...@hotmail.com] Sent: Friday, August 12, 2011 10:54 AM To: common-user@hadoop.apache.org Subject: RE: Hadoop--store a sequence file in distributed cache? This whole thread doesn't make a lot of sense. If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS. (Dino is correct on that account.) Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call. Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory. Joey is correct you can put anything in distributed cache, but you don't want to put an HDFS file in to distributed cache. Distributed cache is a tool for taking something from your job and distributing it to each job tracker as a local object. It does have a bit of overhead. A better example is if you're distributing binary objects that you want on each node. A c++ .so file that you want to call from within your java m/r. If you're not using all of the data in the sequence file, what about using HBase? From: ash...@clearedgeit.com To: common-user@hadoop.apache.org Date: Fri, 12 Aug 2011 09:06:39 -0400 Subject: RE: Hadoop--store a sequence file in distributed cache? If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits. Having to open/close the files during each map call will have a lot of unneeded I/O. You have to be conscious of your java heap size though since you are basically storing the files in RAM. If your files are a few MB in size as you said, then it shouldn't be a problem. If the amount of data you need to store won't fit, consider using HBase as a solution to get access to the data you need. But as Joey said, you can put whatever you want in the Distributed Cache -- as long as you have a reader for it. You should have no problems using the SequenceFile.Reader. -- Adam -Original Message- From: Joey Echeverria [mailto:j...@cloudera.com] Sent: Friday, August 12, 2011 6:28 AM To: common-user@hadoop.apache.org; Sofia Georgiakaki Subject: Re: Hadoop--store a sequence file in distributed cache? You can use any kind of format for files in the distributed cache, so yes you can use sequence files. They should be faster to parse than most text formats. -Joey On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki geosofie_...@yahoo.com wrote: Thank you for the reply! In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I think it would be better to store these files in distributed Cache, so that the whole process would be more efficient -I guess this is the point of using Distributed Cache in the first place! My question is, if I can store sequence files in distributed Cache and handle them using e.g. the SequenceFile.Reader class, or if I should only keep regular text files in distributed Cache and handle them using the usual java API. Thank you very much Sofia PS: The files have small size, a few KB to few MB maximum. From: Dino Kečo dino.k...@gmail.com To: common-user@hadoop.apache.org; Sofia Georgiakaki geosofie_...@yahoo.com Sent: Friday, August 12, 2011 11:30 AM Subject: Re: Hadoop--store a sequence file in distributed cache? Hi Sofia, I assume that output of first job is stored on HDFS. In that case I would directly read file from Mappers without using distributed cache. If you put file into distributed cache that would add one more copy operation into your process. Thanks, dino On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki geosofie_...@yahoo.comwrote: Good morning, I would like to store some files in the
RE: Question about RAID controllers and hadoop
My assumption would be that having a set of 4 raid 0 disks would actually be better than having a controller that allowed pure JBOD of 4 disks due to the cache on the controller. If anyone has any personal experience with this I would love to know performance numbers but our infrastructure guy is doing tests on exactly this over the next couple days so I will pass it along once we have it. Matt -Original Message- From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] Sent: Thursday, August 11, 2011 5:00 PM To: common-user@hadoop.apache.org Subject: Re: Question about RAID controllers and hadoop True, you need a P410 controller. You can create RAID0 for each disk to make it as JBOD. -Bharath From: Koert Kuipers ko...@tresata.com To: common-user@hadoop.apache.org Sent: Thursday, August 11, 2011 2:50 PM Subject: Question about RAID controllers and hadoop Hello all, We are considering using low end HP proliant machines (DL160s and DL180s) for cluster nodes. However with these machines if you want to do more than 4 hard drives then HP puts in a P410 raid controller. We would configure the RAID controller to function as JBOD, by simply creating multiple RAID volumes with one disk. Does anyone have experience with this setup? Is it a good idea, or am i introducing a i/o bottleneck? Thanks for your help! Best, Koert This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Giving filename as key to mapper ?
If you have the source downloaded (and if you don't I would suggest you get it) you can do a search for *InputFormat.java and you will have all the references you need. Also you might want to check out http://codedemigod.com/blog/?p=120 or take a look at the books Hadoop in action or Hadoop: The Definitive Guide. Matt -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Friday, July 15, 2011 9:42 AM To: common-user@hadoop.apache.org Subject: Re: Giving filename as key to mapper ? I am new to this hadoop API. Can anyone give me some tutorial or code snipet on how to write your own input format to do these kind of things. Thanks. On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans ev...@yahoo-inc.com wrote: To add to that if you really want the file name to be the key instead of just calling a different API in your map to get it you will probably need to write your own input format to do it. It should be fairly simple and you can base it off of an existing input format to do it. --Bobby On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote: You can retrieve the filename in the new API as described here: http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename In the old API, its available in the configuration instance of the mapper as key map.input.file. See the table below this section http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse for more such goodies. On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote: Hi, How can I give filename as key to mapper ? I want to know the occurence of word in set of docs, so I want to keep key as filename. Is it possible to give input key as filename in map function ? Thanks, Praveenesh -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Issue with MR code not scaling correctly with data sizes
All, I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets (I can verify less than 1000 fairly easily) but fails when I try to push the set to 10-20k IDs which is annoying when the end goal is 1-10 million. The flow of the program is: 1) Partition the IDs evenly, based on amount of output per value, into a set of keys equal to the number of reduce slots we currently have 2) Use the distributed cache to push the ID file out to the various reducers 3) In the setup of the reducer, populate an int array with the values from the ID file in distributed cache 4) Output a comparison only if the current ID from the values iterator is greater than the current iterator through the int array I realize that this could be done many other ways but this will be part of an oozie workflow so it made sense to just do it in MR for now. My issue is that when I try the larger sized ID files it only outputs part of resulting data set and there are no errors to be found. Part of me thinks that I need to tweak some site configuration properties, due to the size of data that is spilling to disk, but after scanning through all 3 sites I am having issues pin pointing anything I think could be causing this. I moved from reading the file from HDFS to using the distributed cache for the join read thinking that might solve my problem but there seems to be something else I am overlooking. Any advice is greatly appreciated! Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Performance Tunning
Mike, Somewhat of a tangent but it is actually very informative to hear that you are getting bound by I/O with a 2:1 core to disk ratio. Could you share what you used to make those calls? We have been using both a local ganglia daemon as well as the Hadoop ganglia daemon to get an overall look at the cluster and the items of interest, I would assume, would be CPU wait i/o as well as the throughput of block operations. Obviously the disconnect on my side was I didn't realize you were dedicating a physical core per daemon. I am a little surprised that you found that necessary but then again after seeing some of the metrics from my own stress testing I am noticing that we might be over extending with our config on heavy loads. Unfortunately I am working with lower specced hardware at the moment so I don't have the overhead to test that out. Matt -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Tuesday, June 28, 2011 1:31 PM To: common-user@hadoop.apache.org Subject: RE: Performance Tunning Matthew, I understood that Juan was talking about a 2 socket quad core box. We run boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores. Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration. What I was saying was that if you consider 1 core for each TT, DN and RS jobs, thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread cores'. So you could put up 10 m/r slots on the machine. Note that on the main tasks (TT, DN, RS) I dedicate the physical core. Of course your mileage may vary if you're doing non-standard or normal things. A good starting point is 6 mappers and 4 reducers. And of course YMMV depending on if you're using MapR's release, Cloudera, and if you're running HBase or something else on the cluster. From our experience... we end up getting disk I/O bound first, and then network or memory becomes the next constraint. Really the xeon chipsets are really good. HTH -Mike From: matthew.go...@monsanto.com To: common-user@hadoop.apache.org Subject: RE: Performance Tunning Date: Tue, 28 Jun 2011 14:46:40 + Mike, I'm not really sure I have seen a community consensus around how to handle hyper-threading within Hadoop (although I have seen quite a few articles that discuss it). I was assuming that when Juan mentioned they were 4-core boxes that he meant 4 physical cores and not HT cores. I was more stating that the starting point should be 1 slot per thread (or hyper-threaded core) but obviously reviewing the results from ganglia, or any other monitoring solution, will help you come up with a more concrete configuration based on the load. My brain might not be working this morning but how did you get the 10 slots again? That seems low for an 8 physical core box but somewhat overextending for a 4 physical core box. Matt -Original Message- From: im_gu...@hotmail.com [mailto:im_gu...@hotmail.com] On Behalf Of Michel Segel Sent: Tuesday, June 28, 2011 7:39 AM To: common-user@hadoop.apache.org Subject: Re: Performance Tunning Matt, You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor. Note this is only a starting point.you can always tune up. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 11:11 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Per node: 4 cores * 2 processes = 8 slots Datanode: 1 slot Tasktracker: 1 slot Therefore max of 6 slots between mappers and reducers. Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load. configuration property namemapred.tasktracker.map.tasks.maximum/name value2/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value1/value /property property namemapred.child.java.opts/name value-Xmx512m/value /property property namemapred.compress.map.output/name valuetrue/value /property property namemapred.output.compress/name valuetrue/value /property This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have
RE: Queue support from HDFS
Saumitra, Two questions come to mind that could help you narrow down a solution: 1) How quickly do the downstream processes need the transformed data? Reason: If you can delay the processing for a period of time, enough to batch the data into a blob that is a multiple of your block size, then you are obviously going to be working more towards the strong suit of vanilla MR. 2) What else will be running on the cluster? Reason: If this is primarily setup for this use case then how often it runs / what resources it consumes when it does only needs to be optimized if it can't process them fast enough. If it is not then you could always setup a separate pool for this in the fairscheduler and allow for this to use a certain amount of overhead on the cluster when these events are being generated. Outside of the fact that you would have a lot of small files on the cluster (which can be resolved by running a nightly job to blob them and then delete originals) I am not sure I would be too concerned about at least trying out this method. It would be helpful to know the size and type of data coming in as well as what type of operation you are looking to do if you would like a more concrete suggestion. Log data is a prime example of this type of workflow and there are many suggestions out there as well as projects that attempt to address this (i.e. Chukwa). HTH, Matt -Original Message- From: saumitra.shahap...@gmail.com [mailto:saumitra.shahap...@gmail.com] On Behalf Of Saumitra Shahapure Sent: Friday, June 24, 2011 12:12 PM To: common-user@hadoop.apache.org Subject: Queue support from HDFS Hi, Is queue-like structure supported from HDFS where stream of data is processed when it's generated? Specifically, I will have stream of data coming; and data independent operation needs to be applied to it (so only Map function, reducer is identity). I wish to distribute data among nodes using HDFS and start processing it as it arrives, preferably in single MR job. I agree that it can be done by starting new MR job for each batch of data, but is starting many MR jobs frequently for small data chunks a good idea? (Consider new batch arrives after every few sec and processing of one batch takes few mins) Thanks, -- Saumitra S. Shahapure This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Performance Tunning
If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer. Check out the below configs for details on what you are *most likely* running currently: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html HTH, Matt -Original Message- From: Juan P. [mailto:gordoslo...@gmail.com] Sent: Monday, June 27, 2011 2:50 PM To: common-user@hadoop.apache.org Subject: Performance Tunning I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4 cores each. My input data is 4GB in size and it's split into 100MB files. Current configuration is default so block size is 64MB. If I understand it correctly Hadoop should be running 64 Mappers to process the data. I'm running a simple data counting MapReduce and it's taking about 30mins to complete. This seems like way too much, doesn't it? Is there any tunning you guys would recommend to try and see an improvement in performance? Thanks, Pony This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Why I cannot see live nodes in a LAN-based cluster setup?
Did you make sure to define the datanode/tasktracker in the slaves file in your conf directory and push that to both machines? Also have you checked the logs on either to see if there are any errors? Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:24 PM To: HADOOP MLIST Subject: Why I cannot see live nodes in a LAN-based cluster setup? Hi Everyone: I am quite new to hadoop here. I am attempting to set up Hadoop locally in two machines, connected by LAN. Both of them pass the single-node test. However, I failed in two-node cluster setup, as shown in the 2 cases below: 1) set one as dedicated namenode and the other as dedicated datanode 2) set one as both name- and data-node, and the other as just datanode I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues cleared, thus I can always observe the startup of daemon in every datanode. However, by website of *http://(URI of namenode):50070 *it shows only 0 live node for (1) and 1 live node for (2), which is the same as the output by command-line *hadoop dfsadmin -report* Generally it appears that from the namenode you cannot observe the remote datanode alive, let alone a normal across-node MapReduce execution. Could anyone give some hints / instructions at this point? I really appreciate it! Thank. Best Regards Yours Sincerely Jingwei Lu This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Why I cannot see live nodes in a LAN-based cluster setup?
. Already tried 4 time(s). 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s). 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s). 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s). 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s). 18 2011-06-27 13:45:11,678 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 9 time(s). 19 2011-06-27 13:45:11,679 INFO org.apache.hadoop.ipc.RPC: Server at hdl.ucsd.edu/127.0.0.1:54310 not available yet, Z... (just guess, is this due to some porting problem?) Any comments will be greatly appreciated! Best Regards Yours Sincerely Jingwei Lu On Mon, Jun 27, 2011 at 1:28 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Did you make sure to define the datanode/tasktracker in the slaves file in your conf directory and push that to both machines? Also have you checked the logs on either to see if there are any errors? Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:24 PM To: HADOOP MLIST Subject: Why I cannot see live nodes in a LAN-based cluster setup? Hi Everyone: I am quite new to hadoop here. I am attempting to set up Hadoop locally in two machines, connected by LAN. Both of them pass the single-node test. However, I failed in two-node cluster setup, as shown in the 2 cases below: 1) set one as dedicated namenode and the other as dedicated datanode 2) set one as both name- and data-node, and the other as just datanode I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues cleared, thus I can always observe the startup of daemon in every datanode. However, by website of *http://(URI of namenode):50070 *it shows only 0 live node for (1) and 1 live node for (2), which is the same as the output by command-line *hadoop dfsadmin -report* Generally it appears that from the namenode you cannot observe the remote datanode alive, let alone a normal across-node MapReduce execution. Could anyone give some hints / instructions at this point? I really appreciate it! Thank. Best Regards Yours Sincerely Jingwei Lu This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Why I cannot see live nodes in a LAN-based cluster setup?
At this point if that is the correct ip then I would see if you can actually ssh from the DN to the NN to make sure it can actually connect to the other box. If you can successfully connect through ssh then it's just a matter of figuring out why that port is having issues (netstat is your friend in this case). If you see it listening on 54310 then just power cycle the box and try again. Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 5:38 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi Matt and Jeff: Thanks a lot for your instructions. I corrected the mistakes in conf files of DN, and now the log on DN becomes: 2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s). 2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s). 2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s). 2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s). 2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s). 2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s). 2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s). 2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s). 2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s). 2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s). 2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at clock.ucsd.edu/132.239.95.91:54310 not available yet, Z... Seems DN is trying to bind with NN but always fails... Best Regards Yours Sincerely Jingwei Lu On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: As a follow-up to what Jeff posted: go ahead and ignore the message you got on the NN for now. If you look at the address that the DN log shows it is 127.0.0.1 and the ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is trying to bind to itself as if it was still in single machine mode. Make sure that you have correctly pushed the URI for the NN into the config files on both machines and then bounce DFS. Matt -Original Message- From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com] Sent: Monday, June 27, 2011 4:08 PM To: common-user@hadoop.apache.org Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup? http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June 27, 2011 3:58 PM To: common-user@hadoop.apache.org Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup? Hi, I just manually modify the masters slaves files in the both machines. I found something wrong in the log files, as shown below: -- Master : namenote.log: 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310
RE: Poor scalability with map reduce application
Harsh, Is it possible for mapred.reduce.slowstart.completed.maps to even play a significant role in this? The only benefit he would find in tweaking that for his problem would be to spread network traffic from the shuffle over a longer period of time at a cost of having the reducer using resources earlier. Either way he would see this effect across both sets of runs if he is using the default parameters. I guess it would all depend on what kind of network layout the cluster is on. Matt -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, June 21, 2011 12:09 PM To: common-user@hadoop.apache.org Subject: Re: Poor scalability with map reduce application Alberto, On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: I don't know if speculatives maps are on, I'll check it. One thing I observed is that reduces begin before all maps have finished. Let me check also if the difference is on the map side or in the reduce. I believe it's balanced, both are slower when adding more nodes, but i'll confirm that. Maps and reduces are speculative by default, so must've been ON. Could you also post a general input vs. output record counts and statistics like that between your job runs, to correlate? The reducers get scheduled early but do not exactly reduce() until all maps are done. They just keep fetching outputs. Their scheduling can be controlled with some configurations (say, to start only after X% of maps are done -- by default it starts up when 5% of maps are done). -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: large memory tasks
Is the lookup table constant across each of the tasks? You could try putting it into memcached: http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf Matt -Original Message- From: Ian Upright [mailto:i...@upright.net] Sent: Wednesday, June 15, 2011 3:42 PM To: common-user@hadoop.apache.org Subject: large memory tasks Hello, I'm quite new to Hadoop, so I'd like to get an understanding of something. Lets say I have a task that requires 16gb of memory, in order to execute. Lets say hypothetically it's some sort of big lookuptable of sorts that needs that kind of memory. I could have 8 cores run the task in parallel (multithreaded), and all 8 cores can share that 16gb lookup table. On another machine, I could have 4 cores run the same task, and they still share that same 16gb lookup table. Now, with my understanding of Hadoop, each task has it's own memory. So if I have 4 tasks that run on one machine, and 8 tasks on another, then the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, but really, lets say I only have two machines, one with 4 cores and one with 8, each machine only having 24 GB. How can the work be evenly distributed among these machines? Am I missing something? What other ways can this be configured such that this works properly? Thanks, Ian This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.