[jira] [Created] (SPARK-4446) MetadataCleaner schedule task with a wrong param for delay time .
Leo created SPARK-4446: -- Summary: MetadataCleaner schedule task with a wrong param for delay time . Key: SPARK-4446 URL: https://issues.apache.org/jira/browse/SPARK-4446 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Leo MetadataCleaner schedule task with a wrong param for delay time . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4446) MetadataCleaner schedule task with a wrong param for delay time .
[ https://issues.apache.org/jira/browse/SPARK-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214400#comment-14214400 ] Apache Spark commented on SPARK-4446: - User 'Leolh' has created a pull request for this issue: https://github.com/apache/spark/pull/3306 MetadataCleaner schedule task with a wrong param for delay time . - Key: SPARK-4446 URL: https://issues.apache.org/jira/browse/SPARK-4446 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Leo MetadataCleaner schedule task with a wrong param for delay time . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214406#comment-14214406 ] Apache Spark commented on SPARK-4306: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3307 LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Varadharajan Assignee: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214407#comment-14214407 ] Apache Spark commented on SPARK-2208: - User 'XuefengWu' has created a pull request for this issue: https://github.com/apache/spark/pull/3300 local metrics tests can fail on fast machines - Key: SPARK-2208 URL: https://issues.apache.org/jira/browse/SPARK-2208 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Labels: starter I'm temporarily disabling this check. I think the issue is that on fast machines the fetch wait time can actually be zero, even across all tasks. We should see if we can write this in a different way to make sure there is a delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
Sandy Ryza created SPARK-4447: - Summary: Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha Key: SPARK-4447 URL: https://issues.apache.org/jira/browse/SPARK-4447 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
[ https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4447: -- Description: For example, YarnRMClient and YarnRMClientImpl can be merged YarnAllocator and YarnAllocationHandler can be merged Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha Key: SPARK-4447 URL: https://issues.apache.org/jira/browse/SPARK-4447 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Sandy Ryza For example, YarnRMClient and YarnRMClientImpl can be merged YarnAllocator and YarnAllocationHandler can be merged -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
[ https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214411#comment-14214411 ] Sandy Ryza commented on SPARK-4447: --- Planning to work on this. Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha Key: SPARK-4447 URL: https://issues.apache.org/jira/browse/SPARK-4447 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214415#comment-14214415 ] Varadharajan commented on SPARK-4306: - [~matei] I'm really sorry. I'm quite occupied this week. I see, Davies has already created a pull request. Sorry for the inconvenience :( LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Varadharajan Assignee: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4448) Support ConstantObjectInspector for unwrapping data
Cheng Hao created SPARK-4448: Summary: Support ConstantObjectInspector for unwrapping data Key: SPARK-4448 URL: https://issues.apache.org/jira/browse/SPARK-4448 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor ClassCastException when retrieving primitive object from a ConstantObjectInspector by specifying parameters. This is probably a bug for Hive, we just provide a work around in HiveInspectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4448) Support ConstantObjectInspector for unwrapping data
[ https://issues.apache.org/jira/browse/SPARK-4448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214429#comment-14214429 ] Apache Spark commented on SPARK-4448: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3308 Support ConstantObjectInspector for unwrapping data --- Key: SPARK-4448 URL: https://issues.apache.org/jira/browse/SPARK-4448 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor ClassCastException when retrieving primitive object from a ConstantObjectInspector by specifying parameters. This is probably a bug for Hive, we just provide a work around in HiveInspectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4445) Don't display storage level in toDebugString unless RDD is persisted
[ https://issues.apache.org/jira/browse/SPARK-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1421#comment-1421 ] Apache Spark commented on SPARK-4445: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/3310 Don't display storage level in toDebugString unless RDD is persisted Key: SPARK-4445 URL: https://issues.apache.org/jira/browse/SPARK-4445 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Blocker The current approach lists the storage level all the time, even if the RDD is not persisted. The storage level should only be listed if the RDD is persisted. We just need to guard it with a check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4442) Move common unit test utilities into their own package / module
[ https://issues.apache.org/jira/browse/SPARK-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214461#comment-14214461 ] Sean Owen commented on SPARK-4442: -- You can already depend on just core's test code from other modules' test code. Is this more than that? Move common unit test utilities into their own package / module --- Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3962) Mark spark dependency as provided in external libraries
[ https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3962: --- Issue Type: Bug (was: Improvement) Mark spark dependency as provided in external libraries - Key: SPARK-3962 URL: https://issues.apache.org/jira/browse/SPARK-3962 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Blocker Right now there is not an easy way for users to link against the external streaming libraries and not accidentally pull Spark into their assembly jar. We should mark Spark as provided in the external connector pom's so that user applications can simply include those like any other dependency in the user's jar. This is also the best format for third-party libraries that depend on Spark (of which there will eventually be many) so it would be nice for our own build to conform to this nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3962) Mark spark dependency as provided in external libraries
[ https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214480#comment-14214480 ] Patrick Wendell commented on SPARK-3962: I think this is causing the build to fail because of this bug: https://jira.codehaus.org/browse/MSHADE-148 Would it be possible to remove the test-jar dependency on streaming for these modules (/cc @tdas)? If we need to have a moderate amount of code duplication I think it's fine because as it stands now these aren't really usable for people. I looked and I only found some minor dependencies that could be inlined into the modules. Mark spark dependency as provided in external libraries - Key: SPARK-3962 URL: https://issues.apache.org/jira/browse/SPARK-3962 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Blocker Right now there is not an easy way for users to link against the external streaming libraries and not accidentally pull Spark into their assembly jar. We should mark Spark as provided in the external connector pom's so that user applications can simply include those like any other dependency in the user's jar. This is also the best format for third-party libraries that depend on Spark (of which there will eventually be many) so it would be nice for our own build to conform to this nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214486#comment-14214486 ] Sean Owen commented on SPARK-4402: -- Can the Spark code go back and check this before any of it is called, at the start of your program? no that isn't possible. It wouldn't even know which RDDs may be executed at the outset, and, wouldn't be sure that the output dir isn't cleared up by your code before output happens. Here it seems to happen before the output operation starts, which is about as early as possible. I suggest this is the correct behavior and is the current behavior. It's even configurable whether it overwrites or fails when the output dir exists. Of course you can and should check the output directory in your program. In fact your program is in a better position to know whether it should be an error, warning, or whether you should just overwrite the output. Output path validation of an action statement resulting in runtime exception Key: SPARK-4402 URL: https://issues.apache.org/jira/browse/SPARK-4402 Project: Spark Issue Type: Wish Reporter: Vijay Priority: Minor Output path validation is happening at the time of statement execution as a part of lazyevolution of action statement. But if the path already exists then it throws a runtime exception. Hence all the processing completed till that point is lost which results in resource wastage (processing time and CPU usage). If this I/O related validation is done before the RDD action operations then this runtime exception can be avoided. I believe similar validation/ feature is implemented in hadoop also. Example: SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4449) specify port range in spark
wangfei created SPARK-4449: -- Summary: specify port range in spark Key: SPARK-4449 URL: https://issues.apache.org/jira/browse/SPARK-4449 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 In some case, we need specify port range used in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4449) specify port range in spark
[ https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214587#comment-14214587 ] Apache Spark commented on SPARK-4449: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3314 specify port range in spark --- Key: SPARK-4449 URL: https://issues.apache.org/jira/browse/SPARK-4449 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 In some case, we need specify port range used in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214635#comment-14214635 ] Vijay commented on SPARK-4402: -- Thanks for the explanation. It is clear now. Output path validation of an action statement resulting in runtime exception Key: SPARK-4402 URL: https://issues.apache.org/jira/browse/SPARK-4402 Project: Spark Issue Type: Wish Reporter: Vijay Priority: Minor Output path validation is happening at the time of statement execution as a part of lazyevolution of action statement. But if the path already exists then it throws a runtime exception. Hence all the processing completed till that point is lost which results in resource wastage (processing time and CPU usage). If this I/O related validation is done before the RDD action operations then this runtime exception can be avoided. I believe similar validation/ feature is implemented in hadoop also. Example: SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay resolved SPARK-4402. -- Resolution: Not a Problem Output path validation of an action statement resulting in runtime exception Key: SPARK-4402 URL: https://issues.apache.org/jira/browse/SPARK-4402 Project: Spark Issue Type: Wish Reporter: Vijay Priority: Minor Output path validation is happening at the time of statement execution as a part of lazyevolution of action statement. But if the path already exists then it throws a runtime exception. Hence all the processing completed till that point is lost which results in resource wastage (processing time and CPU usage). If this I/O related validation is done before the RDD action operations then this runtime exception can be avoided. I believe similar validation/ feature is implemented in hadoop also. Example: SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4411) Add kill link for jobs in the UI
[ https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214741#comment-14214741 ] Thomas Graves commented on SPARK-4411: -- Please make sure that the modify acls work apply to it. Add kill link for jobs in the UI -- Key: SPARK-4411 URL: https://issues.apache.org/jira/browse/SPARK-4411 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.2.0 Reporter: Kay Ousterhout SPARK-4145 changes the default landing page for the UI to show jobs. We should have a kill link for each job, similar to what we have for each stage, so it's easier for users to kill slow jobs (and the semantics of killing a job are slightly different than killing a stage). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4411) Add kill link for jobs in the UI
[ https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214741#comment-14214741 ] Thomas Graves edited comment on SPARK-4411 at 11/17/14 3:34 PM: Please make sure that the modify acls are applied to it in order to prevent someone killing someone elses job without permission. was (Author: tgraves): Please make sure that the modify acls work apply to it. Add kill link for jobs in the UI -- Key: SPARK-4411 URL: https://issues.apache.org/jira/browse/SPARK-4411 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.2.0 Reporter: Kay Ousterhout SPARK-4145 changes the default landing page for the UI to show jobs. We should have a kill link for each job, similar to what we have for each stage, so it's easier for users to kill slow jobs (and the semantics of killing a job are slightly different than killing a stage). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4450) SparkSQL producing incorrect answer when using --master yarn
Rick Bischoff created SPARK-4450: Summary: SparkSQL producing incorrect answer when using --master yarn Key: SPARK-4450 URL: https://issues.apache.org/jira/browse/SPARK-4450 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: CDH 5.1 Reporter: Rick Bischoff A simple summary program using spark-submit --master local MyJob.py vs. spark-submit --master yarn MyJob.py produces different answers--the output produced by local has been independently verified and is correct, but the output from yarn is incorrect. It does not appear to happen with smaller files, only large files. MyJob.py is from pyspark import SparkContext, SparkConf from pyspark.sql import * def maybeFloat(x): Convert NULLs into 0s if x=='': return 0. else: return float(x) def maybeInt(x): Convert NULLs into 0s if x=='': return 0 else: return int(x) def mapColl(p): return { f1: p[0], f2: p[1], f3: p[2], f4: int(p[3]), f5: int(p[4]), f6: p[5], f7: p[6], f8: p[7], f9: p[8], f10: maybeInt(p[9]), f11: p[10], f12: p[11], f13: p[12], f14: p[13], f15: maybeFloat(p[14]), f16: maybeInt(p[15]), f17: maybeFloat(p[16]) } sc = SparkContext() sqlContext = SQLContext(sc) lines = sc.textFile(sample.csv) fields = lines.map(lambda l: mapColl(l.split(,))) collTable = sqlContext.inferSchema(fields) collTable.registerAsTable(sample) test = sqlContext.sql(SELECT f9, COUNT(*) AS rows, SUM(f15) AS f15sum \ + FROM sample \ + GROUP BY f9) foo = test.collect() print foo sc.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4451) force to kill process after 5 seconds
WangTaoTheTonic created SPARK-4451: -- Summary: force to kill process after 5 seconds Key: SPARK-4451 URL: https://issues.apache.org/jira/browse/SPARK-4451 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic When we restart history server or thrift server, sometimes the processes will not be launched as they were not totally stopping. So I think we'd better wait some seconds and kill it by force after that waiting. I do this referring to Hadoop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4451) force to kill process after 5 seconds
[ https://issues.apache.org/jira/browse/SPARK-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214880#comment-14214880 ] Apache Spark commented on SPARK-4451: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3316 force to kill process after 5 seconds - Key: SPARK-4451 URL: https://issues.apache.org/jira/browse/SPARK-4451 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic When we restart history server or thrift server, sometimes the processes will not be launched as they were not totally stopping. So I think we'd better wait some seconds and kill it by force after that waiting. I do this referring to Hadoop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
tianshuo created SPARK-4452: --- Summary: Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to addresses these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214903#comment-14214903 ] Apache Spark commented on SPARK-4452: - User 'tsdeng' has created a pull request for this issue: https://github.com/apache/spark/pull/3302 Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to addresses these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianshuo updated SPARK-4452: Description: When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. was: When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to addresses these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of
[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianshuo updated SPARK-4452: Description: When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! was: When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress:
[jira] [Created] (SPARK-4453) Simplify Parquet record filter generation
Cheng Lian created SPARK-4453: - Summary: Simplify Parquet record filter generation Key: SPARK-4453 URL: https://issues.apache.org/jira/browse/SPARK-4453 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0, 1.0.0 Reporter: Cheng Lian Current Parquet record filter code uses {{CatalystFilter}} and its sub-classes to represent all Spark SQL Parquet filter. Essentially, these classes combines the original Catalyst predicate expression together with the generated Parquet filter. {{ParquetFilters.findExpression}} then uses these classes to filter out all expressions that can be pushed down. However, this {{findExpression}} function is not necessary at the first place, since we already know whether a predicate can be pushed down or not while trying to generate its corresponding filter. With this in mind, the code size of Parquet record filter generation can be reduced significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
[ https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214950#comment-14214950 ] Apache Spark commented on SPARK-4213: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3317 SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators - Key: SPARK-4213 URL: https://issues.apache.org/jira/browse/SPARK-4213 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 76386e1a23c) Reporter: Terry Siu Priority: Blocker Fix For: 1.2.0 When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from some table limit 1; insert into table sparkbug select 2, '2012-01-01' from some table limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(spark.sql.hive.convertMetastoreParquet, true) hc.setConf(spark.sql.parquet.compression.codec, snappy) import hc._ hc.hql(select * from db.sparkbug where event = '2011-12-01') A scala.MatchError will appear in the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4453) Simplify Parquet record filter generation
[ https://issues.apache.org/jira/browse/SPARK-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214949#comment-14214949 ] Apache Spark commented on SPARK-4453: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3317 Simplify Parquet record filter generation - Key: SPARK-4453 URL: https://issues.apache.org/jira/browse/SPARK-4453 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0, 1.1.0 Reporter: Cheng Lian Current Parquet record filter code uses {{CatalystFilter}} and its sub-classes to represent all Spark SQL Parquet filter. Essentially, these classes combines the original Catalyst predicate expression together with the generated Parquet filter. {{ParquetFilters.findExpression}} then uses these classes to filter out all expressions that can be pushed down. However, this {{findExpression}} function is not necessary at the first place, since we already know whether a predicate can be pushed down or not while trying to generate its corresponding filter. With this in mind, the code size of Parquet record filter generation can be reduced significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214962#comment-14214962 ] tianshuo commented on SPARK-4452: - Originally, we found this problem by seeing Too Many Files Open exception when running SVD job on a big dataset. And it's caused by ExternalSorter dumping too many small files. With the fixes mentioned above(with the prototype), the problem is resolved. I'm still working on finalizing the prototype. Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4454) Race condition in DAGScheduler
Rafal Kwasny created SPARK-4454: --- Summary: Race condition in DAGScheduler Key: SPARK-4454 URL: https://issues.apache.org/jira/browse/SPARK-4454 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Rafal Kwasny Priority: Minor It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:192) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[jira] [Updated] (SPARK-4454) Race condition in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-4454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rafal Kwasny updated SPARK-4454: Description: It seems to be a race condition in DAGScheduler that manifests on jobs with high concurrency: {noformat} Exception in thread main java.util.NoSuchElementException: key not found: 35 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:201) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1292) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1307) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1306) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1306) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1304) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1304) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1275) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:937) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:192) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at
[jira] [Updated] (SPARK-2811) update algebird to 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2811: --- Assignee: Adam Pingel update algebird to 0.8.1 Key: SPARK-2811 URL: https://issues.apache.org/jira/browse/SPARK-2811 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Adam Pingel First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2811) update algebird to 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2811. Resolution: Fixed Fix Version/s: 1.2.0 update algebird to 0.8.1 Key: SPARK-2811 URL: https://issues.apache.org/jira/browse/SPARK-2811 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Adam Pingel Fix For: 1.2.0 First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianshuo updated SPARK-4452: Description: When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! was: When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on whe ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress:
[jira] [Created] (SPARK-4455) Exclude dependency on hbase-annotations module
Ted Yu created SPARK-4455: - Summary: Exclude dependency on hbase-annotations module Key: SPARK-4455 URL: https://issues.apache.org/jira/browse/SPARK-4455 Project: Spark Issue Type: Bug Reporter: Ted Yu As Patrick mentioned in the thread 'Has anyone else observed this build break?' : The error I've seen is this when building the examples project: {code} spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.7 at specified path /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar {code} The reason for this error is that hbase-annotations is using a system scoped dependency in their hbase-annotations pom, and this doesn't work with certain JDK layouts such as that provided on Mac OS: http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom hbase-annotations module is transitively brought in through other HBase modules, we should exclude it from related modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4444) Drop VD type parameter from EdgeRDD
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-. Resolution: Fixed Fix Version/s: 1.2.0 Drop VD type parameter from EdgeRDD --- Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Priority: Blocker Fix For: 1.2.0 Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter. This requires removing the filter method from the EdgeRDD interface, because it depends on vertex attribute caching. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4409) Additional (but limited) Linear Algebra Utils
[ https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214999#comment-14214999 ] Apache Spark commented on SPARK-4409: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/3319 Additional (but limited) Linear Algebra Utils - Key: SPARK-4409 URL: https://issues.apache.org/jira/browse/SPARK-4409 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Burak Yavuz Priority: Minor This ticket is to discuss the addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486). The proposed methods for addition are: For `Matrix` - map: maps the values in the matrix with a given function. Produces a new matrix. - update: the values in the matrix are updated with a given function. Occurs in place. Factory methods for `DenseMatrix`: - *zeros: Generate a matrix consisting of zeros - *ones: Generate a matrix consisting of ones - *eye: Generate an identity matrix - *rand: Generate a matrix consisting of i.i.d. uniform random numbers - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers - *diag: Generate a diagonal matrix from a supplied vector *These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code dirtier. I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`. Factory methods for `SparseMatrix`: - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar. - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers. - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers. - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training. Factory methods for `Matrices`: - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`. - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix. - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215018#comment-14215018 ] Sandy Ryza commented on SPARK-4452: --- I haven't thought the implications out fully, but it worries me that data structures wouldn't be in charge of their own spilling. It seems like for different data structures, different spilling patterns and sizes could be more efficient, and placing control in the hands of the shuffle memory manager negates this. Another fix (simpler but maybe less flexible) worth considering would be to define a static fraction of memory that goes to each of the aggregator and the external sorter, right? One way that could be implemented is that each time the aggregator gets memory, it allocates some for the sorter. Also, for SPARK-2926, we are working on a tiered merger that would avoid the Too many open files problem by only merging a limited number of files at once. Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215019#comment-14215019 ] Ryan Williams commented on SPARK-3630: -- [~aash] I've not seen this since my previous reports, could be that I just wasn't actually running 1.2.0, I'll let you know here if I see it again, thanks! Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215022#comment-14215022 ] Apache Spark commented on SPARK-4434: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3320 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 - Key: SPARK-4434 URL: https://issues.apache.org/jira/browse/SPARK-4434 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Andrew Or Priority: Blocker When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the application JAR URL, e.g. {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar {code} In Spark 1.1.1 and 1.2.0, this same command now fails with an error: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id {code} I tried changing my URL to conform to the new format, but this either resulted in an error or a job that failed: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) {code} If I omit the extra slash: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Sending launch command to spark://joshs-mbp.att.net:7077 Driver successfully submitted as driver-20141116143235-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20141116143235-0002 is ERROR Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) {code} This bug effectively prevents users from using {{spark-submit}} in cluster mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4452: -- Affects Version/s: 1.1.0 Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4455) Exclude dependency on hbase-annotations module
[ https://issues.apache.org/jira/browse/SPARK-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215028#comment-14215028 ] Apache Spark commented on SPARK-4455: - User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/3286 Exclude dependency on hbase-annotations module -- Key: SPARK-4455 URL: https://issues.apache.org/jira/browse/SPARK-4455 Project: Spark Issue Type: Bug Reporter: Ted Yu As Patrick mentioned in the thread 'Has anyone else observed this build break?' : The error I've seen is this when building the examples project: {code} spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.7 at specified path /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar {code} The reason for this error is that hbase-annotations is using a system scoped dependency in their hbase-annotations pom, and this doesn't work with certain JDK layouts such as that provided on Mac OS: http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom hbase-annotations module is transitively brought in through other HBase modules, we should exclude it from related modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1358) Continuous integrated test should be involved in Spark ecosystem
[ https://issues.apache.org/jira/browse/SPARK-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215042#comment-14215042 ] shane knapp commented on SPARK-1358: [~aash] -- depending on the hardware reqs, we might be able to make this happen. we have some extra hardware, so a small (3-5 server) dedicated spark e2e test cluster could be deployed. we're still in the process of building out our new infrastructure and will have a better idea of what we will have available for this really soon. Continuous integrated test should be involved in Spark ecosystem - Key: SPARK-1358 URL: https://issues.apache.org/jira/browse/SPARK-1358 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: xiajunluan Currently, Spark only contains unit test and performance test, but I think it is not enough for customer to evaluate status about their cluster and spark version they will used, and it is necessary to build continuous integrated test for spark development , it could included 1. complex applications test cases for spark/spark streaming/graphx 2. stresss test cases 3. fault tolerance test cases 4.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215047#comment-14215047 ] Sandy Ryza commented on SPARK-4452: --- A third possible fix would be to have the shuffle memory manager consider fairness in terms of spillable data structures instead of threads. Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4456) Document why spilling depends on both elements read and memory used
Sandy Ryza created SPARK-4456: - Summary: Document why spilling depends on both elements read and memory used Key: SPARK-4456 URL: https://issues.apache.org/jira/browse/SPARK-4456 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Minor It's not clear to me why the number of records should have an effect on when a data structure spills. It would probably be useful to document this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4393) Memory leak in connection manager timeout thread
[ https://issues.apache.org/jira/browse/SPARK-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215054#comment-14215054 ] Apache Spark commented on SPARK-4393: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3321 Memory leak in connection manager timeout thread Key: SPARK-4393 URL: https://issues.apache.org/jira/browse/SPARK-4393 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Fix For: 1.2.0 This JIRA tracks a fix for a memory leak in ConnectionManager's TimerTasks, originally reported in [a comment|https://issues.apache.org/jira/browse/SPARK-3633?focusedCommentId=14208318page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14208318] on SPARK-3633. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4457) Document how to build for Hadoop versions greater than 2.4
Sandy Ryza created SPARK-4457: - Summary: Document how to build for Hadoop versions greater than 2.4 Key: SPARK-4457 URL: https://issues.apache.org/jira/browse/SPARK-4457 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215058#comment-14215058 ] Joseph K. Bradley commented on SPARK-3717: -- I took a look at the rowToColumnStore method. It may work, but I don't think it will scale well. The main problems are that it will create an enormous number of small objects, and that it does not take advantage of sparse vectors. It would be ideal to batch values and to maintain sparsity. I have an implementation almost done which I can push soon. DecisionTree, RandomForest: Partition by feature Key: SPARK-3717 URL: https://issues.apache.org/jira/browse/SPARK-3717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4457) Document how to build for Hadoop versions greater than 2.4
[ https://issues.apache.org/jira/browse/SPARK-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215060#comment-14215060 ] Apache Spark commented on SPARK-4457: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3322 Document how to build for Hadoop versions greater than 2.4 -- Key: SPARK-4457 URL: https://issues.apache.org/jira/browse/SPARK-4457 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215061#comment-14215061 ] Apache Spark commented on SPARK-4434: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3320 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 - Key: SPARK-4434 URL: https://issues.apache.org/jira/browse/SPARK-4434 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Andrew Or Priority: Blocker When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the application JAR URL, e.g. {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar {code} In Spark 1.1.1 and 1.2.0, this same command now fails with an error: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id {code} I tried changing my URL to conform to the new format, but this either resulted in an error or a job that failed: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) {code} If I omit the extra slash: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Sending launch command to spark://joshs-mbp.att.net:7077 Driver successfully submitted as driver-20141116143235-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20141116143235-0002 is ERROR Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) {code} This bug effectively prevents users from using {{spark-submit}} in cluster mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4439) Expose RandomForest in Python
[ https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215067#comment-14215067 ] Apache Spark commented on SPARK-4439: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3320 Expose RandomForest in Python - Key: SPARK-4439 URL: https://issues.apache.org/jira/browse/SPARK-4439 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4458) Skip compilation of tests classes when using make-distribution
Tathagata Das created SPARK-4458: Summary: Skip compilation of tests classes when using make-distribution Key: SPARK-4458 URL: https://issues.apache.org/jira/browse/SPARK-4458 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.1.0, 1.0.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4458) Skip compilation of tests classes when using make-distribution
[ https://issues.apache.org/jira/browse/SPARK-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-4458: - The make-distribution generates Spark distributions, and therefore does not require building of test classes. Maven -DskipTests only eliminates running of tests, but does not eliminate compilation of tests (see [Maven docs](http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skipTests)). To prevent compilation of tests (reduces compilation time), we need to add -Dmaven.test.skip (again see [Maven docs](http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skip)). Skip compilation of tests classes when using make-distribution -- Key: SPARK-4458 URL: https://issues.apache.org/jira/browse/SPARK-4458 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4458) Skip compilation of tests classes when using make-distribution
[ https://issues.apache.org/jira/browse/SPARK-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215087#comment-14215087 ] Apache Spark commented on SPARK-4458: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/3324 Skip compilation of tests classes when using make-distribution -- Key: SPARK-4458 URL: https://issues.apache.org/jira/browse/SPARK-4458 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4266) Avoid expensive JavaScript for StagePages with huge numbers of tasks
[ https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-4266: -- Assignee: Kay Ousterhout Avoid expensive JavaScript for StagePages with huge numbers of tasks Key: SPARK-4266 URL: https://issues.apache.org/jira/browse/SPARK-4266 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Critical Some of the new javascript added to handle hiding metrics significantly slows the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, it took over a minute for the page to finish loading in Chrome on my laptop). There are at least two issues here: (1) The new table striping java script is much slower than the old CSS. The fancier javascript is only needed for the stage summary table, so we should change the task table back to using CSS so that it doesn't slow the page load for jobs with lots of tasks. (2) The javascript associated with hiding metrics is expensive when jobs have lots of tasks, I think because the jQuery selectors have to traverse a much larger DOM. The ID selectors are much more efficient, so we should consider switching to these, and/or avoiding this code in additional-metrics.js: $(input:checkbox:not(:checked)).each(function() { var column = table . + $(this).attr(name); $(column).hide(); }); by initially hiding the data when we generate the page in the render function instead, which should be easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4180: --- Fix Version/s: 1.2.0 SparkContext constructor should throw exception if another SparkContext is already running -- Key: SPARK-4180 URL: https://issues.apache.org/jira/browse/SPARK-4180 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Priority: Blocker Fix For: 1.2.0 Spark does not currently support multiple concurrently-running SparkContexts in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor should throw an exception if there is an active SparkContext that has not been shut down via {{stop()}}. PySpark already does this, but the Scala SparkContext should do the same thing. The current behavior with multiple active contexts is unspecified / not understood and it may be the source of confusing errors (see the user error report in SPARK-4080, for example). This should be pretty easy to add: just add a {{activeSparkContext}} field to the SparkContext companion object and {{synchronize}} on it in the constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
Alok Saldanha created SPARK-4459: Summary: JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.1.0, 1.0.2 Reporter: Alok Saldanha I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215202#comment-14215202 ] Alok Saldanha commented on SPARK-4459: -- I created a standalone gist to demonstrate the problem, please see https://gist.github.com/alokito/40878fc25af21984463f JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Reporter: Alok Saldanha I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215222#comment-14215222 ] tianshuo commented on SPARK-4452: - Hi, [~sandyr]: Your concern about data structures wouldn' be in charge of their spilling is legit. That's why I'm trying to make a incremental change: 1. The data structure still asks the ShuffleMemoryManager and decides if to spill itself. 2. But ShuffleMemoryManager can also trigger the spill of an object if the memory quota of a thread is used up. 2 happens as a last resort when memory is not enough for the requesting object. Also as you mentioned in the third solution, if the shuffle manager consider fairness among objs,it has to have a way to trigger the spilling of an object in a situation where current allocation is not fair. The memory manager has more of a global knowledge about memory allocation, so giving spilling ability to the manager could lead to more optimal memory allocation. The the spilling can only be triggered from the object itself, like currently, one obj may not be aware of the memory usage of other objs and keep holding the memory. My point is the data structure should be able to trigger spilling by itself, but it should also be able to handle when shuffleManager asks it to spill. I'm also considering the obj can reject to spill itself do address the concern you mentioned. Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215233#comment-14215233 ] tianshuo commented on SPARK-4452: - Currently, the two instances of Spillable, ExternalSorter and ExternalAppendOnlyMap does not seem to have complex logic to decide when to spill. They all ask as much memory as it can. I also consider this general strategy of asking as much memory as it can would apply to most cases. So I guess it would be better if the memory manager could trigger the spilling. Also make the spillable to reject a spilling request could possibly address your concern? Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4460) RandomForest classification uses wrong threshold
Joseph K. Bradley created SPARK-4460: Summary: RandomForest classification uses wrong threshold Key: SPARK-4460 URL: https://issues.apache.org/jira/browse/SPARK-4460 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley RandomForest was modified to use WeightedEnsembleModel, but it needs to use a different threshold for prediction than GradientBoosting does. Fix: WeightedEnsembleModel.scala:70 should use threshold 0.0 for boosting and 0.5 for forests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4452) Enhance Sort-based Shuffle to avoid spilling small files
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215222#comment-14215222 ] tianshuo edited comment on SPARK-4452 at 11/17/14 10:00 PM: Hi, [~sandyr]: Your concern about data structures wouldn' be in charge of their spilling is legit. That's why I'm trying to make a incremental change: 1. The data structure still asks the ShuffleMemoryManager and decides if to spill itself. 2. But ShuffleMemoryManager can also trigger the spill of an object if the memory quota of a thread is used up. 2 happens as a last resort when memory is not enough for the requesting object. Also as you mentioned in the third solution, if the shuffle manager consider fairness among objs,it has to have a way to trigger the spilling of an object in a situation where current allocation is not fair. The memory manager has more of a global knowledge about memory allocation, so giving spilling ability to the manager could lead to more optimal memory allocation. If the spilling can only be triggered from the object itself, like currently, one obj may not be aware of the memory usage of other objs and keep holding the memory. My point is the data structure should be able to trigger spilling by itself, but it should also be able to handle when shuffleManager asks it to spill. I'm also considering the obj can reject to spill itself do address the concern you mentioned. was (Author: tianshuo): Hi, [~sandyr]: Your concern about data structures wouldn' be in charge of their spilling is legit. That's why I'm trying to make a incremental change: 1. The data structure still asks the ShuffleMemoryManager and decides if to spill itself. 2. But ShuffleMemoryManager can also trigger the spill of an object if the memory quota of a thread is used up. 2 happens as a last resort when memory is not enough for the requesting object. Also as you mentioned in the third solution, if the shuffle manager consider fairness among objs,it has to have a way to trigger the spilling of an object in a situation where current allocation is not fair. The memory manager has more of a global knowledge about memory allocation, so giving spilling ability to the manager could lead to more optimal memory allocation. The the spilling can only be triggered from the object itself, like currently, one obj may not be aware of the memory usage of other objs and keep holding the memory. My point is the data structure should be able to trigger spilling by itself, but it should also be able to handle when shuffleManager asks it to spill. I'm also considering the obj can reject to spill itself do address the concern you mentioned. Enhance Sort-based Shuffle to avoid spilling small files Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger
[jira] [Updated] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4452: -- Summary: Shuffle data structures can starve others on the same thread for memory (was: Enhance Sort-based Shuffle to avoid spilling small files) Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215269#comment-14215269 ] Sandy Ryza commented on SPARK-4452: --- Updated the title to reflect the specific problem. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: tianshuo When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215279#comment-14215279 ] Apache Spark commented on SPARK-4434: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3326 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 - Key: SPARK-4434 URL: https://issues.apache.org/jira/browse/SPARK-4434 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Andrew Or Priority: Blocker When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the application JAR URL, e.g. {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar {code} In Spark 1.1.1 and 1.2.0, this same command now fails with an error: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id {code} I tried changing my URL to conform to the new format, but this either resulted in an error or a job that failed: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) {code} If I omit the extra slash: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Sending launch command to spark://joshs-mbp.att.net:7077 Driver successfully submitted as driver-20141116143235-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20141116143235-0002 is ERROR Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) {code} This bug effectively prevents users from using {{spark-submit}} in cluster mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215301#comment-14215301 ] Apache Spark commented on SPARK-4459: - User 'alokito' has created a pull request for this issue: https://github.com/apache/spark/pull/3327 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Reporter: Alok Saldanha I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4461) Spark should not relies on mapred-site.xml for classpath
Zhan Zhang created SPARK-4461: - Summary: Spark should not relies on mapred-site.xml for classpath Key: SPARK-4461 URL: https://issues.apache.org/jira/browse/SPARK-4461 Project: Spark Issue Type: Bug Components: YARN Reporter: Zhan Zhang Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, the library is shipped to cluster with distributed cache at run-time, and may not be available at every node manager. Instead of relying on mapred-site.xml, spark should handle this by its own, for example, through ADD_JARs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215327#comment-14215327 ] Davies Liu commented on SPARK-4395: --- Workaround: remove cache() or cache() after inferSchema(): {code} ratings_base_RDD = sc.textFile().map(xxx) schemaRatings = sqlContext.inferSchema(ratings_base_RDD).cache() schemaRatings.registerTempTable(RatingsTable) sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() {code} Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour -- Key: SPARK-4395 URL: https://issues.apache.org/jira/browse/SPARK-4395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: version 1.2.0-SNAPSHOT Reporter: Sameer Farooqui When I run this command it hangs for one to many hours and then finally returns with successful results: sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() Note, the lab environment below is still active, so let me know if you'd like to just access it directly. +++ My Environment +++ - 1-node cluster in Amazon - RedHat 6.5 64-bit - java version 1.7.0_67 - SBT version: sbt-0.13.5 - Scala version: scala-2.11.2 Ran: sudo yum -y update git clone https://github.com/apache/spark sudo sbt assembly +++ Data file used +++ http://blueplastic.com/databricks/movielens/ratings.dat {code} import re import string from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' def parse_ratings_line(line): ... match = re.search(RATINGS_PATTERN, line) ... if match is None: ... # Optionally, you can change this to just ignore if each line of data is not critical. ... raise Error(Invalid logline: %s % logline) ... return Row( ... UserID= int(match.group(1)), ... MovieID = int(match.group(2)), ... Rating= int(match.group(3)), ... Timestamp = int(match.group(4))) ... ratings_base_RDD = (sc.textFile(file:///home/ec2-user/movielens/ratings.dat) ...# Call the parse_apace_log_line function on each line. ....map(parse_ratings_line) ...# Caches the objects in memory since they will be queried multiple times. ....cache()) ratings_base_RDD.count() 1000209 ratings_base_RDD.first() Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) schemaRatings = sqlContext.inferSchema(ratings_base_RDD) schemaRatings.registerTempTable(RatingsTable) sqlContext.sql(SELECT * FROM RatingsTable limit 5).collect() {code} (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212852#comment-14212852 ] Anson Abraham edited comment on SPARK-1867 at 11/17/14 10:40 PM: - Yes. i added 3 data nodes just for this. And I had 2 of them as my worker node and the other as my master. Still getting that issue. Also the jar files were all supplied by Cloudera. Also all this was done through Cloudera manager parcels. I installed spark as standalone on CDH5.2 as stated above. So the jars have to be the same. But as a just in case, i rsync'd them across the machines and still hitting this issue. This is all occurring when running through spark-shell of course. was (Author: ansonism): Yes. i added 3 data nodes just for this. And I had 2 of them as my worker node and the other as my master. Still getting that issue. Also the jar files were all supplied by Cloudera. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1,
[jira] [Updated] (SPARK-4461) Spark should not relies on mapred-site.xml for classpath
[ https://issues.apache.org/jira/browse/SPARK-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-4461: -- Description: Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, the library is shipped to cluster with distributed cache at run-time, and may not be available at every node manager. Instead of relying on mapred-site.xml, spark should handle this by its own, for example, through ADD_JARs, SPARK_CLASSPATH, etc was: Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, the library is shipped to cluster with distributed cache at run-time, and may not be available at every node manager. Instead of relying on mapred-site.xml, spark should handle this by its own, for example, through ADD_JARs. Spark should not relies on mapred-site.xml for classpath Key: SPARK-4461 URL: https://issues.apache.org/jira/browse/SPARK-4461 Project: Spark Issue Type: Bug Components: YARN Reporter: Zhan Zhang Currently spark read mapred-site.xml to get the class path. From hadoop 2.6, the library is shipped to cluster with distributed cache at run-time, and may not be available at every node manager. Instead of relying on mapred-site.xml, spark should handle this by its own, for example, through ADD_JARs, SPARK_CLASSPATH, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2087: Target Version/s: 1.3.0 (was: 1.2.0) Clean Multi-user semantics for thrift JDBC/ODBC server. --- Key: SPARK-2087 URL: https://issues.apache.org/jira/browse/SPARK-2087 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Priority: Minor Configuration and temporary tables should exist per-user. Cached tables should be shared across users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4338) Remove yarn-alpha support
[ https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4338: - Assignee: Sandy Ryza Remove yarn-alpha support - Key: SPARK-4338 URL: https://issues.apache.org/jira/browse/SPARK-4338 Project: Spark Issue Type: Sub-task Components: YARN Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4462) flume-sink build broken in SBT
Michael Armbrust created SPARK-4462: --- Summary: flume-sink build broken in SBT Key: SPARK-4462 URL: https://issues.apache.org/jira/browse/SPARK-4462 Project: Spark Issue Type: Bug Components: Streaming Reporter: Michael Armbrust Assignee: Tathagata Das {code} $ sbt streaming-flume-sink/compile Using /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`). [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * com.typesafe.sbt:sbt-git:0.6.1 - 0.6.2 [warn] * com.typesafe.sbt:sbt-site:0.7.0 - 0.7.1 [warn] Run 'evicted' to see detailed eviction warnings [info] Loading project definition from /Users/marmbrus/workspace/spark/project [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * org.apache.maven.wagon:wagon-provider-api:1.0-beta-6 - 2.2 [warn] Run 'evicted' to see detailed eviction warnings NOTE: SPARK_HIVE is deprecated, please use -Phive and -Phive-thriftserver flags. Enabled default scala profile [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) [warn] There may be incompatibilities among your library dependencies. [warn] Here are some of the libraries that were evicted: [warn] * com.google.guava:guava:10.0.1 - 14.0.1 [warn] Run 'evicted' to see detailed eviction warnings [info] Compiling 5 Scala sources and 3 Java sources to /Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/classes... [error] /Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:19: object slf4j is not a member of package org [error] import org.slf4j.{Logger, LoggerFactory} [error]^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:29: not found: type Logger [error] @transient private var log_ : Logger = null [error] ^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:32: not found: type Logger [error] protected def log: Logger = { [error] ^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/Logging.scala:40: not found: value LoggerFactory [error] log_ = LoggerFactory.getLogger(className) [error] ^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/EventBatch.java:9: object specific is not a member of package org.apache.avro [error] public class EventBatch extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { [error] ^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/EventBatch.java:9: object specific is not a member of package org.apache.avro [error] public class EventBatch extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { [error] ^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/SparkSinkEvent.java:9: object specific is not a member of package org.apache.avro [error] public class SparkSinkEvent extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { [error] ^ [error] /Users/marmbrus/workspace/spark/external/flume-sink/target/scala-2.10/src_managed/main/compiled_avro/org/apache/spark/streaming/flume/sink/SparkSinkEvent.java:9: object specific is not a member of package org.apache.avro [error] public class SparkSinkEvent extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { [error]
[jira] [Updated] (SPARK-3184) Allow user to specify num tasks to use for a table
[ https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3184: Target Version/s: 1.3.0 (was: 1.2.0) Allow user to specify num tasks to use for a table -- Key: SPARK-3184 URL: https://issues.apache.org/jira/browse/SPARK-3184 Project: Spark Issue Type: Improvement Components: SQL Reporter: Andy Konwinski Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4443) Statistics bug for external table in spark sql hive
[ https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4443: Priority: Critical (was: Major) Statistics bug for external table in spark sql hive --- Key: SPARK-4443 URL: https://issues.apache.org/jira/browse/SPARK-4443 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Priority: Critical Fix For: 1.2.0 When table is external, the `totalSize` is always zero, which will influence join strategy(always use broadcast join for external table) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2873) OOM happens when group by and join operation with big data
[ https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2873: Target Version/s: 1.3.0 (was: 1.2.0) OOM happens when group by and join operation with big data --- Key: SPARK-2873 URL: https://issues.apache.org/jira/browse/SPARK-2873 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: guowei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4074) No exception for drop nonexistent table
[ https://issues.apache.org/jira/browse/SPARK-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4074: Target Version/s: 1.3.0 (was: 1.2.0) No exception for drop nonexistent table --- Key: SPARK-4074 URL: https://issues.apache.org/jira/browse/SPARK-4074 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3720. - Resolution: Duplicate support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2472) Spark SQL Thrift server sometimes assigns wrong job group name
[ https://issues.apache.org/jira/browse/SPARK-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2472: Target Version/s: 1.3.0 (was: 1.2.0) Spark SQL Thrift server sometimes assigns wrong job group name -- Key: SPARK-2472 URL: https://issues.apache.org/jira/browse/SPARK-2472 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Priority: Minor Sample beeline session used to reproduce this issue: {code} 0: jdbc:hive2://localhost:1 drop table test; +-+ | result | +-+ +-+ No rows selected (0.614 seconds) 0: jdbc:hive2://localhost:1 create table hive_table_copy as select * from hive_table; +--++ | key | value | +--++ +--++ No rows selected (0.493 seconds) 0 {code} The second statement results in two stages, the first stage is labeled with the first {{drop table}} statement rather than the CTAS statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3298: Assignee: (was: Michael Armbrust) [SQL] registerAsTable / registerTempTable overwrites old tables --- Key: SPARK-3298 URL: https://issues.apache.org/jira/browse/SPARK-3298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Priority: Minor Labels: newbie At least in Spark 1.0.2, calling registerAsTable(a) when a had been registered before does not cause an error. However, there is no way to access the old table, even though it may be cached and taking up space. How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3298: Target Version/s: 1.3.0 (was: 1.2.0) [SQL] registerAsTable / registerTempTable overwrites old tables --- Key: SPARK-3298 URL: https://issues.apache.org/jira/browse/SPARK-3298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Priority: Minor Labels: newbie At least in Spark 1.0.2, calling registerAsTable(a) when a had been registered before does not cause an error. However, there is no way to access the old table, even though it may be cached and taking up space. How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2554) CountDistinct and SumDistinct should do partial aggregation
[ https://issues.apache.org/jira/browse/SPARK-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2554: Target Version/s: 1.3.0 (was: 1.2.0) CountDistinct and SumDistinct should do partial aggregation --- Key: SPARK-2554 URL: https://issues.apache.org/jira/browse/SPARK-2554 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor {{CountDistinct}} and {{SumDistinct}} should first do a partial aggregation and return unique value sets in each partition as partial results. Shuffle IO can be greatly reduced in in cases that there are only a few unique values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3379) Implement 'POWER' for sql
[ https://issues.apache.org/jira/browse/SPARK-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3379: Target Version/s: 1.3.0 (was: 1.2.0) Implement 'POWER' for sql - Key: SPARK-3379 URL: https://issues.apache.org/jira/browse/SPARK-3379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Add support for the mathematical function POWER within spark sql. Spitted from SPARK-3176 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4269) Make wait time in BroadcastHashJoin configurable
[ https://issues.apache.org/jira/browse/SPARK-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4269: Target Version/s: 1.3.0 (was: 1.2.0) Make wait time in BroadcastHashJoin configurable Key: SPARK-4269 URL: https://issues.apache.org/jira/browse/SPARK-4269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Fix For: 1.2.0 In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table. In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl
[ https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3955: Component/s: Build Different versions between jackson-mapper-asl and jackson-core-asl -- Key: SPARK-3955 URL: https://issues.apache.org/jira/browse/SPARK-3955 Project: Spark Issue Type: Bug Components: Build, Spark Core, SQL Affects Versions: 1.1.0 Reporter: Jongyoul Lee In the parent pom.xml, specified a version of jackson-mapper-asl. This is used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl is not same as jackson-core-asl. This is because other libraries use several versions of jackson, so other version of jackson-core-asl is assembled. Simply, fix this problem if pom.xml has a specific version information of jackson-core-asl. If it's not set, a version 1.9.11 is merged info assembly.jar and we cannot use jackson library properly. {code} [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the shaded jar. [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the shaded jar. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl
[ https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3955: Target Version/s: 1.3.0 (was: 1.1.1, 1.2.0) Different versions between jackson-mapper-asl and jackson-core-asl -- Key: SPARK-3955 URL: https://issues.apache.org/jira/browse/SPARK-3955 Project: Spark Issue Type: Bug Components: Build, Spark Core, SQL Affects Versions: 1.1.0 Reporter: Jongyoul Lee In the parent pom.xml, specified a version of jackson-mapper-asl. This is used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl is not same as jackson-core-asl. This is because other libraries use several versions of jackson, so other version of jackson-core-asl is assembled. Simply, fix this problem if pom.xml has a specific version information of jackson-core-asl. If it's not set, a version 1.9.11 is merged info assembly.jar and we cannot use jackson library properly. {code} [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the shaded jar. [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the shaded jar. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2551) Cleanup FilteringParquetRowInputFormat
[ https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2551: Target Version/s: 1.3.0 (was: 1.2.0) Cleanup FilteringParquetRowInputFormat -- Key: SPARK-2551 URL: https://issues.apache.org/jira/browse/SPARK-2551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2449) Spark sql reflection code requires a constructor taking all the fields for the table
[ https://issues.apache.org/jira/browse/SPARK-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2449: Target Version/s: 1.3.0 (was: 1.2.0) Spark sql reflection code requires a constructor taking all the fields for the table Key: SPARK-2449 URL: https://issues.apache.org/jira/browse/SPARK-2449 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Ian O Connell The reflection code does a lookup for the fields passed to the constructor to make the types for the table. Specifically the code: val params = t.member(nme.CONSTRUCTOR).asMethod.paramss in ScalaReflection.scala Simple repo case from the spark shell: trait PersonTrait extends Product case class Person(a: Int) extends PersonTrait val l: List[PersonTrait] = List(1, 2, 3, 4).map(Person(_)) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ sc.parallelize(l).registerAsTable(people) scala sc.parallelize(l).registerAsTable(people) scala.ScalaReflectionException: none is not a method at scala.reflect.api.Symbols$SymbolApi$class.asMethod(Symbols.scala:279) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asMethod(Symbols.scala:73) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:52) at -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4453) Simplify Parquet record filter generation
[ https://issues.apache.org/jira/browse/SPARK-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4453: Assignee: Cheng Lian Simplify Parquet record filter generation - Key: SPARK-4453 URL: https://issues.apache.org/jira/browse/SPARK-4453 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0, 1.1.0 Reporter: Cheng Lian Assignee: Cheng Lian Current Parquet record filter code uses {{CatalystFilter}} and its sub-classes to represent all Spark SQL Parquet filter. Essentially, these classes combines the original Catalyst predicate expression together with the generated Parquet filter. {{ParquetFilters.findExpression}} then uses these classes to filter out all expressions that can be pushed down. However, this {{findExpression}} function is not necessary at the first place, since we already know whether a predicate can be pushed down or not while trying to generate its corresponding filter. With this in mind, the code size of Parquet record filter generation can be reduced significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2178) createSchemaRDD is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2178: Target Version/s: 1.3.0 (was: 1.2.0) createSchemaRDD is not thread safe -- Key: SPARK-2178 URL: https://issues.apache.org/jira/browse/SPARK-2178 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust This is because implicit type tags are not thread safe. We could fix this with compile time macros (which could also make the conversion a lot faster). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4453) Simplify Parquet record filter generation
[ https://issues.apache.org/jira/browse/SPARK-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4453: Priority: Critical (was: Major) Simplify Parquet record filter generation - Key: SPARK-4453 URL: https://issues.apache.org/jira/browse/SPARK-4453 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0, 1.1.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Current Parquet record filter code uses {{CatalystFilter}} and its sub-classes to represent all Spark SQL Parquet filter. Essentially, these classes combines the original Catalyst predicate expression together with the generated Parquet filter. {{ParquetFilters.findExpression}} then uses these classes to filter out all expressions that can be pushed down. However, this {{findExpression}} function is not necessary at the first place, since we already know whether a predicate can be pushed down or not while trying to generate its corresponding filter. With this in mind, the code size of Parquet record filter generation can be reduced significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4266) Avoid expensive JavaScript for StagePages with huge numbers of tasks
[ https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215391#comment-14215391 ] Apache Spark commented on SPARK-4266: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/3328 Avoid expensive JavaScript for StagePages with huge numbers of tasks Key: SPARK-4266 URL: https://issues.apache.org/jira/browse/SPARK-4266 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Critical Some of the new javascript added to handle hiding metrics significantly slows the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, it took over a minute for the page to finish loading in Chrome on my laptop). There are at least two issues here: (1) The new table striping java script is much slower than the old CSS. The fancier javascript is only needed for the stage summary table, so we should change the task table back to using CSS so that it doesn't slow the page load for jobs with lots of tasks. (2) The javascript associated with hiding metrics is expensive when jobs have lots of tasks, I think because the jQuery selectors have to traverse a much larger DOM. The ID selectors are much more efficient, so we should consider switching to these, and/or avoiding this code in additional-metrics.js: $(input:checkbox:not(:checked)).each(function() { var column = table . + $(this).attr(name); $(column).hide(); }); by initially hiding the data when we generate the page in the render function instead, which should be easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215395#comment-14215395 ] Andrew Or commented on SPARK-4452: -- Hey [~tianshuo] do you see this issue only for sort-based shuffle? Have you been able to reproduce it on hash-based shuffle? Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Tianshuo Deng When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4463) Add (de)select all button for additional metrics in webUI
Kay Ousterhout created SPARK-4463: - Summary: Add (de)select all button for additional metrics in webUI Key: SPARK-4463 URL: https://issues.apache.org/jira/browse/SPARK-4463 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor The current behavior of the Show Additional Metrics button is a little confusing and has been noted by some ([~andrewor14]) to be annoying. Instead, we should have an option at the top labeled (De)select all that selects or deselects all of the options. Clicking Show additional metrics should not change whether any of the metrics are shown/hidden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4460) RandomForest classification uses wrong threshold
[ https://issues.apache.org/jira/browse/SPARK-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-4460. Resolution: Invalid Realized this was invalid. Current implementation is fine, except for corner case at 0.5 RandomForest classification uses wrong threshold Key: SPARK-4460 URL: https://issues.apache.org/jira/browse/SPARK-4460 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley RandomForest was modified to use WeightedEnsembleModel, but it needs to use a different threshold for prediction than GradientBoosting does. Fix: WeightedEnsembleModel.scala:70 should use threshold 0.0 for boosting and 0.5 for forests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215411#comment-14215411 ] Tianshuo Deng commented on SPARK-4452: -- Hi, [~andrewor14]: Actually hash-based shuffle does not go as bad as sort-based shuffle on this particular problem. We were able to bypass this problem by using hash-based shuffle. This problem was so bad for me also because the elementsRead bug, so that could be also another reason why hash-based shuffle didn't break as badly. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Tianshuo Deng When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215417#comment-14215417 ] Arun Ahuja commented on SPARK-3633: --- [~andrewor14] We were using Hash-Based shuffle when we were running into this due to the Snappy+Kryo issue with sort based shuffle Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Priority: Blocker Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org