[jira] [Created] (MAPREDUCE-5278) Perf: Distributed cache is broken when JT staging dir is not on the default FS
Xi Fang created MAPREDUCE-5278: -- Summary: Perf: Distributed cache is broken when JT staging dir is not on the default FS Key: MAPREDUCE-5278 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278 Project: Hadoop Map/Reduce Issue Type: Bug Components: distributed-cache Affects Versions: 1-win Environment: Windows Reporter: Xi Fang Today, we set the JobTracker staging dir (mapreduce.jobtracker.staging.root.dir) to point to HDFS even though ASV is the default file system. There are a few reason why this config was chosen: To prevent leak of the storage account creds to the user's storage account (IOW, keep job.xml in the cluster). This is needed until HADOOP-444 is fixed. It uses HDFS for the transient job files what is good for two reasons – a) it does not flood the user's storage account with irrelevant data/files b) it leverages HDFS locality for small files However, this approach conflicts with how distributed cache caching works, completely negating the feature's functionality. When files are added to the distributed cache (thru files/achieves/libjars hadoop generic options), they are copied to the job tracker staging dir only if they reside on a file system different that the jobtracker's. Later on, this path is used as a key to cache the files locally on the tasktracker's machine, and avoid localization (download/unzip) of the distributed cache files if they are already localized. In our configuration the caching is completely disabled and we always end up copying dist cache files to the JT staging dir first and localizing them on the tasktracker machine second. This is especially not good for Oozie scenarios as Oozie uses dist cache to populate Hive/Pig jars throughout the cluster. Easy workaround is to config mapreduce.jobtracker.staging.root.dir in mapred-site.xml to be on the default FS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-5279) mapreduce scheduling deadlock
PengZhang created MAPREDUCE-5279: Summary: mapreduce scheduling deadlock Key: MAPREDUCE-5279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5279 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2, scheduler Affects Versions: 2.0.3-alpha Reporter: PengZhang YARN-2 imported cpu dimension scheduling, but MR RMContainerAllocator doesn't take into account virtual cores while scheduling reduce tasks. This may cause more reduce tasks to be scheduled because memory is enough. And on a small cluster, this will end with deadlock, all running containers are reduce tasks but map phase is not finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Plan to create release candidate for 0.23.8
The vote passed with 15 +1's (9 binding) and 0 -1's. I will start the release today. Thanks, Tom On 5/17/13 4:10 PM, Thomas Graves tgra...@yahoo-inc.com wrote: Hello all, We've had a few critical issues come up in 0.23.7 that I think warrants a 0.23.8 release. The main one is MAPREDUCE-5211. There are a couple of other issues that I want finished up and get in before we spin it. Those include HDFS-3875, HDFS-4805, and HDFS-4835. I think those are on track to finish up early next week. So I hope to spin 0.23.8 soon after this vote completes. Please vote '+1' to approve this plan. Voting will close on Friday May 24th at 2:00pm PDT. Thanks, Tom Graves
[VOTE] Release Apache Hadoop 0.23.8
I've created a release candidate (RC0) for hadoop-0.23.8 that I would like to release. This release is a sustaining release with several important bug fixes in it. The most critical one is MAPREDUCE-5211. The RC is available at: http://people.apache.org/~tgraves/hadoop-0.23.8-candidate-0/ The RC tag in svn is here: http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.8-rc0/ The maven artifacts are available via repository.apache.org. Please try the release and vote; the vote will run for the usual 7 days. I am +1 (binding). thanks, Tom Graves
Re: [VOTE] Release Apache Hadoop 2.0.4.1-alpha
+1, verified MD5 and signature. Did a full build, started pseudo cluster, run a few MR jobs, verified httpfs works. Thanks. On Sat, May 25, 2013 at 10:01 AM, Sangjin Lee sj...@apache.org wrote: +1 (non-binding) Thanks, Sangjin On Fri, May 24, 2013 at 8:48 PM, Konstantin Boudnik c...@apache.org wrote: All, I have created a release candidate (rc0) for hadoop-2.0.4.1-alpha that I would like to release. This is a stabilization release that includes fixed for a couple a of issues discovered in the testing with BigTop 0.6.0 release candidate. The RC is available at: http://people.apache.org/~cos/hadoop-2.0.4.1-alpha-rc0/ The RC tag in svn is here: http://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.4.1-alpha-rc0 The maven artifacts are available via repository.apache.org. Please try the release bits and vote; the vote will run for the usual 7 days. Thanks for your voting Cos -- Alejandro
Re: [VOTE] Release Apache Hadoop 0.23.8
+1, verified MD5 and signature. Did a full build, started pseudo cluster, run a few MR jobs, verified httpfs works. Thanks. On Tue, May 28, 2013 at 9:00 AM, Thomas Graves tgra...@yahoo-inc.comwrote: I've created a release candidate (RC0) for hadoop-0.23.8 that I would like to release. This release is a sustaining release with several important bug fixes in it. The most critical one is MAPREDUCE-5211. The RC is available at: http://people.apache.org/~tgraves/hadoop-0.23.8-candidate-0/ The RC tag in svn is here: http://svn.apache.org/viewvc/hadoop/common/tags/release-0.23.8-rc0/ The maven artifacts are available via repository.apache.org. Please try the release and vote; the vote will run for the usual 7 days. I am +1 (binding). thanks, Tom Graves -- Alejandro
Deadline Extension: 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)
we apologize if you receive multiple copies of this message === CALL FOR PAPERS 2013 Workshop on Middleware for HPC and Big Data Systems MHPC '13 as part of Euro-Par 2013, Aachen, Germany === Date: August 27, 2012 Workshop URL: http://m-hpc.org Springer LNCS SUBMISSION DEADLINE: June 10, 2013 - LNCS Full paper submission (extended) June 28, 2013 - Lightning Talk abstracts SCOPE Extremely large, diverse, and complex data sets are generated from scientific applications, the Internet, social media and other applications. Data may be physically distributed and shared by an ever larger community. Collecting, aggregating, storing and analyzing large data volumes presents major challenges. Processing such amounts of data efficiently has been an issue to scientific discovery and technological advancement. In addition, making the data accessible, understandable and interoperable includes unsolved problems. Novel middleware architectures, algorithms, and application development frameworks are required. In this workshop we are particularly interested in original work at the intersection of HPC and Big Data with regard to middleware handling and optimizations. Scope is existing and proposed middleware for HPC and big data, including analytics libraries and frameworks. The goal of this workshop is to bring together software architects, middleware and framework developers, data-intensive application developers as well as users from the scientific and engineering community to exchange their experience in processing large datasets and to report their scientific achievement and innovative ideas. The workshop also offers a dedicated forum for these researchers to access the state of the art, to discuss problems and requirements, to identify gaps in current and planned designs, and to collaborate in strategies for scalable data-intensive computing. The workshop will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections. Presentations may be accompanied by interactive demonstrations. TOPICS Topics of interest include, but are not limited to: - Middleware including: Hadoop, Apache Drill, YARN, Spark/Shark, Hive, Pig, Sqoop, HBase, HDFS, S4, CIEL, Oozie, Impala, Storm and Hyrack - Data intensive middleware architecture - Libraries/Frameworks including: Apache Mahout, Giraph, UIMA and GraphLab - NG Databases including Apache Cassandra, MongoDB and CouchDB/Couchbase - Schedulers including Cascading - Middleware for optimized data locality/in-place data processing - Data handling middleware for deployment in virtualized HPC environments - Parallelization and distributed processing architectures at the middleware level - Integration with cloud middleware and application servers - Runtime environments and system level support for data-intensive computing - Skeletons and patterns - Checkpointing - Programming models and languages - Big Data ETL - Stream processing middleware - In-memory databases for HPC - Scalability and interoperability - Large-scale data storage and distributed file systems - Content-centric addressing and networking - Execution engines, languages and environments including CIEL/Skywriting - Performance analysis, evaluation of data-intensive middleware - In-depth analysis and performance optimizations in existing data-handling middleware, focusing on indexing/fast storing or retrieval between compute and storage nodes - Highly scalable middleware optimized for minimum communication - Use cases and experience for popular Big Data middleware - Middleware security, privacy and trust architectures DATES Papers: Rolling abstract submission June 10, 2013 - Full paper submission (extended) July 8, 2013 - Acceptance notification October 3, 2013 - Camera-ready version due Lightning Talks: June 28, 2013 - Deadline for lightning talk abstracts July 15, 2013 - Lightning talk notification August 27, 2013 - Workshop Date TPC CHAIR Michael Alexander (chair), TU Wien, Austria Anastassios Nanos (co-chair), NTUA, Greece Jie Tao (co-chair), Karlsruhe Institut of Technology, Germany Lizhe Wang (co-chair), Chinese Academy of Sciences, China Gianluigi Zanetti (co-chair), CRS4, Italy PROGRAM COMMITTEE Amitanand Aiyer, Facebook, USA Costas Bekas, IBM, Switzerland Jakob Blomer, CERN, Switzerland William Gardner, University of Guelph, Canada José Gracia, HPC Center of the University of Stuttgart, Germany Zhenghua Guom, Indiana University, USA Marcus Hardt, Karlsruhe Institute of Technology, Germany Sverre Jarp, CERN, Switzerland Christopher Jung, Karlsruhe Institute of Technology, Germany Andreas Knüpfer - Technische Universität Dresden, Germany Nectarios Koziris, National Technical University of Athens, Greece Yan Ma, Chinese Academy of Sciences, China Martin Schulz - Lawrence Livermore National Laboratory
Re: [VOTE] Release Apache Hadoop 2.0.4.1-alpha
+1 Checksum and signature match, ran some unit tests, verified w/ a diff of release-2.0.4-alpha that the release contains MAPREDUCE-5240 and HADOOP-9407, plus some fixups to the release notes. -C On Fri, May 24, 2013 at 8:48 PM, Konstantin Boudnik c...@apache.org wrote: All, I have created a release candidate (rc0) for hadoop-2.0.4.1-alpha that I would like to release. This is a stabilization release that includes fixed for a couple a of issues discovered in the testing with BigTop 0.6.0 release candidate. The RC is available at: http://people.apache.org/~cos/hadoop-2.0.4.1-alpha-rc0/ The RC tag in svn is here: http://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.4.1-alpha-rc0 The maven artifacts are available via repository.apache.org. Please try the release bits and vote; the vote will run for the usual 7 days. Thanks for your voting Cos
5 second minimum shuffle time
Hi, I'm running v0.23 in a large cluster, and have found that the shuffle time for reduce tasks is always at least 5 seconds, even when the amount of data read by the reduce task is tiny (e.g., just 18 bytes). This shuffle time floor suggests that there's a heartbeat interval or something that has to elapse before the shuffle begins, but I can't find any sign of such a delay in the code base. Can anyone shed some light on why this is occurring? Thanks, Kay
[jira] [Reopened] (MAPREDUCE-5036) Default shuffle handler port should not be 8080
[ https://issues.apache.org/jira/browse/MAPREDUCE-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reopened MAPREDUCE-5036: --- Default shuffle handler port should not be 8080 --- Key: MAPREDUCE-5036 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5036 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.5-beta Attachments: MAPREDUCE-5036-13562.patch, MAPREDUCE-5036.patch The shuffle handler port (mapreduce.shuffle.port) defaults to 8080. This is a pretty common port for web services, and is likely to cause unnecessary port conflicts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira