[jira] [Commented] (SPARK-4475) PySpark failed to initialize if localhost can not be resolved
[ https://issues.apache.org/jira/browse/SPARK-4475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332161#comment-14332161 ] Sean Owen commented on SPARK-4475: -- When does localhost fail to resolve? see the discussion on the PR. It is not clear that 127.0.0.1 is better, since even Python SocketServer examples don't use it. PySpark failed to initialize if localhost can not be resolved - Key: SPARK-4475 URL: https://issues.apache.org/jira/browse/SPARK-4475 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Davies Liu {code} Traceback (most recent call last): File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, line 44, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line 107, in __init__ conf) File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line 159, in _do_init self._accumulatorServer = accumulators._start_update_server() File /home/hduser/Downloads/spark-1.1.0/python/pyspark/accumulators.py, line 251, in _start_update_server server = AccumulatorServer((localhost, 0), _UpdateRequestHandler) File /usr/lib/python2.7/SocketServer.py, line 408, in __init__ self.server_bind() File /usr/lib/python2.7/SocketServer.py, line 419, in server_bind self.socket.bind(self.server_address) File /usr/lib/python2.7/socket.py, line 224, in meth return getattr(self._sock,name)(*args) socket.gaierror: [Errno -5] No address associated with hostname {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5710) Combines two adjacent `Cast` expressions into one
[ https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5710: - Priority: Minor (was: Major) Affects Version/s: (was: 1.3.0) 1.2.1 The original suggested PR couldn't be merged, but I wonder if another is going to be proposed for the narrower change, to merge identical neighboring casts? Combines two adjacent `Cast` expressions into one - Key: SPARK-5710 URL: https://issues.apache.org/jira/browse/SPARK-5710 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: guowei Priority: Minor A plan after `analyzer` with `typeCoercionRules` may produce many `cast` expressions. we can combine the adjacent ones. For example. create table test(a decimal(3,1)); explain select * from test where a*2-11; == Physical Plan == Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType()) 1) HiveTableScan [a#5], (MetastoreRelation default, test, None), None -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out
[ https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332249#comment-14332249 ] Apache Spark commented on SPARK-5751: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4720 Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out -- Key: SPARK-5751 URL: https://issues.apache.org/jira/browse/SPARK-5751 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Labels: flaky-test The Test JDBC query execution test case times out occasionally, all other test cases are just fine. The failure output only contains service startup command line without any log output. Guess somehow the test case misses the log file path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4808: - Labels: backport-needed (was: ) Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Labels: backport-needed Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4808: - Target Version/s: 1.2.2, 1.4.0, 1.3.1 (was: 1.3.0, 1.2.2, 1.4.0) Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Labels: backport-needed Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4808: - Target Version/s: 1.3.0, 1.2.2, 1.4.0 (was: 1.3.0, 1.4.0) Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Labels: backport-needed Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4958) Bake common tools like ganglia into Spark AMI
[ https://issues.apache.org/jira/browse/SPARK-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-4958. - Resolution: Duplicate Fix Version/s: (was: 1.3.0) Closing this as a duplicate of SPARK-3821 since we're covering the addition of stuff like Ganglia to the AMIs in that issue. Bake common tools like ganglia into Spark AMI - Key: SPARK-4958 URL: https://issues.apache.org/jira/browse/SPARK-4958 Project: Spark Issue Type: Sub-task Components: EC2 Reporter: Nicholas Chammas Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5943) Update the API to remove several warns in building for Spark Streaming
Saisai Shao created SPARK-5943: -- Summary: Update the API to remove several warns in building for Spark Streaming Key: SPARK-5943 URL: https://issues.apache.org/jira/browse/SPARK-5943 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Saisai Shao Priority: Minor old {{awaitTermination(timeout: Long)}} is deprecated and updated to {{awaitTerminationOrTimeout(timeout: Long): Boolean}} in version 1.3, here change the related code to reduce warns about this while compiling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332266#comment-14332266 ] Andrew Or commented on SPARK-4808: -- [~sowen] not fully. The PR you linked to is just a temporary fix. The real fix probably won't make it into 1.3 Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332303#comment-14332303 ] Nicholas Chammas commented on SPARK-3821: - For those wanting to use the work being done as part of this issue before it gets merged upstream, I posted some [instructions on Stack Overflow|http://stackoverflow.com/a/28639669/877069] in response to a related question. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is on
[ https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332074#comment-14332074 ] Apache Spark commented on SPARK-5942: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4719 DataFrame should not do query optimization when dataFrameEagerAnalysis is on Key: SPARK-5942 URL: https://issues.apache.org/jira/browse/SPARK-5942 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor DataFrame will force query optimization to happen right away for the commands and queries with side effects. However, I think we should not do that when dataFrameEagerAnalysis is on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is off
[ https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-5942: --- Summary: DataFrame should not do query optimization when dataFrameEagerAnalysis is off (was: DataFrame should not do query optimization when dataFrameEagerAnalysis is on) DataFrame should not do query optimization when dataFrameEagerAnalysis is off - Key: SPARK-5942 URL: https://issues.apache.org/jira/browse/SPARK-5942 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor DataFrame will force query optimization to happen right away for the commands and queries with side effects. However, I think we should not do that when dataFrameEagerAnalysis is on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5669. -- Resolution: Won't Fix Fix Version/s: (was: 1.3.0) Target Version/s: (was: 1.3.0, 1.1.2, 1.2.2) I think this is basically 'WontFix' since the net result is that this change only went into master, on the grounds that there's an argument the licensing is OK. However this is made moot anyway by SPARK-5814. Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is off
[ https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-5942: --- Description: DataFrame will force query optimization to happen right away for the commands and queries with side effects. However, I think we should not do that when dataFrameEagerAnalysis is off. was: DataFrame will force query optimization to happen right away for the commands and queries with side effects. However, I think we should not do that when dataFrameEagerAnalysis is on. DataFrame should not do query optimization when dataFrameEagerAnalysis is off - Key: SPARK-5942 URL: https://issues.apache.org/jira/browse/SPARK-5942 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor DataFrame will force query optimization to happen right away for the commands and queries with side effects. However, I think we should not do that when dataFrameEagerAnalysis is off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is on
Liang-Chi Hsieh created SPARK-5942: -- Summary: DataFrame should not do query optimization when dataFrameEagerAnalysis is on Key: SPARK-5942 URL: https://issues.apache.org/jira/browse/SPARK-5942 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor DataFrame will force query optimization to happen right away for the commands and queries with side effects. However, I think we should not do that when dataFrameEagerAnalysis is on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332079#comment-14332079 ] Sean Owen commented on SPARK-4808: -- [~andrewor14] Is this resolved for 1.2.2 / 1.3.0? https://github.com/apache/spark/pull/4420#issuecomment-75178396 Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications
[ https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332085#comment-14332085 ] Sean Owen commented on SPARK-4605: -- It's only my opinion, but I believe this (and other proposed new packages) is the kind of thing that can exist outside core Spark and doesn't net-net benefit everyone if pushed in the core distribution. Doing so has the benefit of making it more readily available, but also of un-blessing similar efforts in favor of an official one. It makes an already exceptionally complex build even bigger. On the one hand you get synchronized releases; on the other hand, you have to synchronize releases. Of course you can make this argument about a lot of packages, perhaps even ones already in Spark. Proposed Contribution: Spark Kernel to enable interactive Spark applications Key: SPARK-4605 URL: https://issues.apache.org/jira/browse/SPARK-4605 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Chip Senkbeil Attachments: Kernel Architecture Widescreen.pdf, Kernel Architecture.pdf Project available on Github: https://github.com/ibm-et/spark-kernel This architecture is describing running kernel code that was demonstrated at the StrataConf in Barcelona, Spain. Enables applications to interact with a Spark cluster using Scala in several ways: * Defining and running core Spark Tasks * Collecting results from a cluster without needing to write to external data store ** Ability to stream results using well-defined protocol * Arbitrary Scala code definition and execution (without submitting heavy-weight jars) Applications can be hosted and managed separate from the Spark cluster using the kernel as a proxy to communicate requests. The Spark Kernel implements the server side of the IPython Kernel protocol, the rising “de-facto” protocol for language (Python, Haskell, etc.) execution. Inherits a suite of industry adopted clients such as the IPython Notebook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5943) Update the API to remove several warns in building for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332328#comment-14332328 ] Apache Spark commented on SPARK-5943: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4722 Update the API to remove several warns in building for Spark Streaming -- Key: SPARK-5943 URL: https://issues.apache.org/jira/browse/SPARK-5943 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Saisai Shao Priority: Minor old {{awaitTermination(timeout: Long)}} is deprecated and updated to {{awaitTerminationOrTimeout(timeout: Long): Boolean}} in version 1.3, here change the related code to reduce warns about this while compiling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing
[ https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332394#comment-14332394 ] Nicholas Chammas commented on SPARK-5944: - cc [~davies], [~joshrosen] Python release docs say SNAPSHOT + Author is missing Key: SPARK-5944 URL: https://issues.apache.org/jira/browse/SPARK-5944 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.1 Reporter: Nicholas Chammas Priority: Minor http://spark.apache.org/docs/latest/api/python/index.html As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 1.2.1. Furthermore, in the footer it says Copyright 2014, Author. It should probably say something something else or be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing
Nicholas Chammas created SPARK-5944: --- Summary: Python release docs say SNAPSHOT + Author is missing Key: SPARK-5944 URL: https://issues.apache.org/jira/browse/SPARK-5944 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.1 Reporter: Nicholas Chammas Priority: Minor http://spark.apache.org/docs/latest/api/python/index.html As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 1.2.1. Furthermore, in the footer it says Copyright 2014, Author. It should probably say something something else or be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5946) Add Python API for Kafka direct stream
Saisai Shao created SPARK-5946: -- Summary: Add Python API for Kafka direct stream Key: SPARK-5946 URL: https://issues.apache.org/jira/browse/SPARK-5946 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Add the Python API for Kafka direct stream. Currently only adds {{createDirectStream}} API, no {{createRDD}} API, since it needs some Python wraps of Java object, will improve this according to the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5946) Add Python API for Kafka direct stream
[ https://issues.apache.org/jira/browse/SPARK-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333018#comment-14333018 ] Apache Spark commented on SPARK-5946: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4723 Add Python API for Kafka direct stream -- Key: SPARK-5946 URL: https://issues.apache.org/jira/browse/SPARK-5946 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Add the Python API for Kafka direct stream. Currently only adds {{createDirectStream}} API, no {{createRDD}} API, since it needs some Python wraps of Java object, will improve this according to the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-911: - Affects Version/s: (was: 1.0.0) Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Fix For: 1.4.0 If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332490#comment-14332490 ] Imran Rashid commented on SPARK-5928: - so I'm pretty sure the issue is not just large records, its from large shuffle blocks. I tried a slight variations on my original failing program, with the same shuffle volume but smaller individual records, and both fail. {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ i = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) (2 * i, arr) } rdd.partitionBy(new org.apache.spark.HashPartitioner(2)).count() {code} or {code} val rdd = sc.parallelize(1 to 2e6.toInt, 2).mapPartitionsWithIndex{ case(part,itr) = if(part == 0) { itr.map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } } else { Iterator() } } rdd.map{x = (scala.util.Random.nextInt % 1000 * 2, x)}.groupByKey(2).count() {code} However, the failure mode is slightly different, and I don't fully understand. It does always have a {{FetchFailedException: Adjusted frame length exceeds 2147483647}}. But then during the retry, it decides to change from two tasks in the second stage, to just one task. I don't understand why the number of tasks would change. I understand that one of my executors might get removed, and so both tasks might get executed locally -- but I don't see why it would convert it into one task. In any case, these programs should have smaller records, but still have a 2gb remote fetch shuffle block. Note that I'm trying to play some games w/ getting data from one partition to another to make a remote fetch -- but I can't actually guarantee that. Sometimes the programs succeed when they are local fetches. So they don't exhibit the issue everytime, but they do fail regularly. Sometimes they go on to succeed (I guess after one of the executors is removed completely, and so there are no remote fetches anymore), and sometimes they fail repeatedly. Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at
[jira] [Commented] (SPARK-765) Test suite should run Spark example programs
[ https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332438#comment-14332438 ] Nicholas Chammas commented on SPARK-765: Seems like a good idea. [~joshrosen] I assume this is still to be done, right? Test suite should run Spark example programs Key: SPARK-765 URL: https://issues.apache.org/jira/browse/SPARK-765 Project: Spark Issue Type: New Feature Components: Examples Reporter: Josh Rosen The Spark test suite should also run each of the Spark example programs (the PySpark suite should do the same). This should be done through a shell script or other mechanism to simulate the environment setup used by end users that run those scripts. This would prevent problems like SPARK-764 from making it into releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1001) Memory leak when reading sequence file and then sorting
[ https://issues.apache.org/jira/browse/SPARK-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332444#comment-14332444 ] Nicholas Chammas commented on SPARK-1001: - It's probably tough given how long ago this was reported, but can anyone confirm whether this is still an issue on the latest release (1.2.1)? Memory leak when reading sequence file and then sorting --- Key: SPARK-1001 URL: https://issues.apache.org/jira/browse/SPARK-1001 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 0.8.0 Reporter: Matthew Cheah Labels: Hadoop, Memory Spark appears to build up a backlog of unreachable byte arrays when an RDD is constructed from a sequence file, and then that RDD is sorted. I have a class that wraps a Java ArrayList, that can be serialized and written to a Hadoop SequenceFile (I.e. Implements the Writable interface). Let's call it WritableDataRow. It can take a Java List as its argument to wrap around, and also has a copy constructor. Setup: 10 slaves, launched via EC2, 65.9GB RAM each, dataset is 100GB of text, 120GB when in sequence file format (not using compression to compact the bytes). CDH4.2.0-backed hadoop cluster. First, building the RDD from a CSV and then sorting on index 1 works fine: {code} scala import scala.collection.JavaConversions._ // Other imports here as well import scala.collection.JavaConversions._ scala val rddAsTextFile = sc.textFile(s3n://some-bucket/events-*.csv) rddAsTextFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:14 scala val rddAsWritableDataRows = rddAsTextFile.map(x = new WritableDataRow(x.split(\\|).toList)) rddAsWritableDataRows: org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow] = MappedRDD[2] at map at console:19 scala val rddAsKeyedWritableDataRows = rddAsWritableDataRows.map(x = (x.getContents().get(1).toString(), x)); rddAsKeyedWritableDataRows: org.apache.spark.rdd.RDD[(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = MappedRDD[4] at map at console:22 scala val orderedFunct = new org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, WritableDataRow)](rddAsKeyedWritableDataRows) orderedFunct: org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = org.apache.spark.rdd.OrderedRDDFunctions@587acb54 scala orderedFunct.sortByKey(true).count(); // Actually triggers the computation, as stated in a different e-mail thread res0: org.apache.spark.rdd.RDD[(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = MapPartitionsRDD[8] at sortByKey at console:27 {code} The above works without too many surprises. I then save it as a Sequence File (using JavaPairRDD as a way to more easily call saveAsHadoopFile(), and this is how it's done in our Java-based application): {code} scala val pairRDD = new JavaPairRDD(rddAsWritableDataRows.map(x = (NullWritable.get(), x))); pairRDD: org.apache.spark.api.java.JavaPairRDD[org.apache.hadoop.io.NullWritable,com.palantir.finance.datatable.server.spark.WritableDataRow] = org.apache.spark.api.java.JavaPairRDD@8d2e9d9 scala pairRDD.saveAsHadoopFile(hdfs://hdfs-master-url:9010/blah, classOf[NullWritable], classOf[WritableDataRow], classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat[NullWritable, WritableDataRow]]); … 2013-12-11 20:09:14,444 [main] INFO org.apache.spark.SparkContext - Job finished: saveAsHadoopFile at console:26, took 1052.116712748 s {code} And now I want to get the RDD from the sequence file and sort THAT, and this is when I monitor Ganglia and ps aux and notice the memory usage climbing ridiculously: {code} scala val rddAsSequenceFile = sc.sequenceFile(hdfs://hdfs-master-url:9010/blah, classOf[NullWritable], classOf[WritableDataRow]).map(x = new WritableDataRow(x._2)); // Invokes copy constructor to get around re-use of writable objects rddAsSequenceFile: org.apache.spark.rdd.RDD[com.palantir.finance.datatable.server.spark.WritableDataRow] = MappedRDD[19] at map at console:19 scala val orderedFunct = new org.apache.spark.rdd.OrderedRDDFunctions[String, WritableDataRow, (String, WritableDataRow)](rddAsSequenceFile.map(x = (x.getContents().get(1).toString(), x))) orderedFunct: org.apache.spark.rdd.OrderedRDDFunctions[String,com.palantir.finance.datatable.server.spark.WritableDataRow,(String, com.palantir.finance.datatable.server.spark.WritableDataRow)] = org.apache.spark.rdd.OrderedRDDFunctions@6262a9a6 scalaorderedFunct.sortByKey().count(); {code} (On the necessity to copy writables
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332480#comment-14332480 ] Imran Rashid commented on SPARK-5928: - Thanks Kay, that does make sense. I think that gave me enough info to at least summarize in another issue -- I just opened SPARK-5945, please add / correct anything I have written there. Good question about the record size -- I'm pretty sure that is not the issue, but I am hitting some other strange behavior which I'm trying to understand, I'll update this issue soon with more info. Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at
[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5945: Description: While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} was: While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the logic. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run
[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-5945: Description: While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the logic. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} was: While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with also keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the logic. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC
[jira] [Updated] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5253: - Summary: LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package (was: LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package --- Key: SPARK-5253 URL: https://issues.apache.org/jira/browse/SPARK-5253 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1050) Investigate AnyRefMap
[ https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-1050: Target Version/s: 1.4.0 Since Spark can now be built with Scala 2.11, I believe this issue can now be looked into. Tagging this with a 1.4.0 target so it can be reviewed by then. Investigate AnyRefMap - Key: SPARK-1050 URL: https://issues.apache.org/jira/browse/SPARK-1050 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Mark Hamstra Priority: Minor Once Spark is built using Scala 2.11, we should investigate (particularly on heavily-used, performance-critical code paths) replacing usage of HashMap with the new, higher-performance AnyRefMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
Imran Rashid created SPARK-5945: --- Summary: Spark should not retry a stage infinitely on a FetchFailedException Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with also keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the logic. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1050) Investigate AnyRefMap
[ https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-1050. -- Resolution: Won't Fix Investigate AnyRefMap - Key: SPARK-1050 URL: https://issues.apache.org/jira/browse/SPARK-1050 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Mark Hamstra Priority: Minor Once Spark is built using Scala 2.11, we should investigate (particularly on heavily-used, performance-critical code paths) replacing usage of HashMap with the new, higher-performance AnyRefMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1050) Investigate AnyRefMap
[ https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-1050. -- Resolution: Fixed I'm going to close this for now. If somebody has time to review the new Scala hash map, we can reopen it. In general the performance characteristics of the Scala collection library is not very satisfying. I have some thoughts on performance optimizations in general and I will post something soon. Investigate AnyRefMap - Key: SPARK-1050 URL: https://issues.apache.org/jira/browse/SPARK-1050 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Mark Hamstra Priority: Minor Once Spark is built using Scala 2.11, we should investigate (particularly on heavily-used, performance-critical code paths) replacing usage of HashMap with the new, higher-performance AnyRefMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-1050) Investigate AnyRefMap
[ https://issues.apache.org/jira/browse/SPARK-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-1050: Investigate AnyRefMap - Key: SPARK-1050 URL: https://issues.apache.org/jira/browse/SPARK-1050 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Mark Hamstra Priority: Minor Once Spark is built using Scala 2.11, we should investigate (particularly on heavily-used, performance-critical code paths) replacing usage of HashMap with the new, higher-performance AnyRefMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org