[jira] [Updated] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3633: --- Summary: Fetches failure observed after SPARK-2711 (was: PR 1707/commit #4fde28c is problematic) > Fetches failure observed after SPARK-2711 > - > > Key: SPARK-3633 > URL: https://issues.apache.org/jira/browse/SPARK-3633 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.1.0 >Reporter: Nishkam Ravi > > Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. > Recently upgraded to Spark 1.1. The workload fails with the following error > message(s): > {code} > 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, > c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, > c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) > 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages > {code} > In order to identify the problem, I carried out change set analysis. As I go > back in time, the error message changes to: > {code} > 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, > c1706.halxg.cloudera.com): java.io.FileNotFoundException: > /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 > (Too many open files) > java.io.FileOutputStream.open(Native Method) > java.io.FileOutputStream.(FileOutputStream.java:221) > > org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) > > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules
Josh Rosen created SPARK-3634: - Summary: Python modules added through addPyFile should take precedence over system modules Key: SPARK-3634 URL: https://issues.apache.org/jira/browse/SPARK-3634 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.1.0, 1.0.2 Reporter: Josh Rosen Python modules added through {{SparkContext.addPyFile()}} are currently _appended_ to {{sys.path}}; this is probably the opposite of the behavior that we want, since it causes system versions of modules to take precedence over versions explicitly added by users. To fix this, we should change the {{sys.path}} manipulation code in {{context.py}} and {{worker.py}} to prepend files to {{sys.path}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142881#comment-14142881 ] Xuefu Zhang edited comment on SPARK-3622 at 9/22/14 4:39 AM: - Thanks for your comments, [~pwendell]. I understand caching A would be helpful if I need to transform it to get B and C separately. My proposal is to get B and C just by one pass of A, so A doens't even need to be cached. Here is an example how it may be used in Hive. {code} JavaPairRDD table = sparkContext.hadoopRDD(..); Map mappedRDDs = table.mapPartitions(mapFunction); JavaPairRDD rddA = mapperRDDs.get("A"); JavaPairRDD rddB = mapperRDDs.get("B"); JavaPairRDD sortedRddA = rddA.sortByKey(); javaPairRDD groupedRddB = rddB.groupByKey(); // further processing sortedRddA and groupedRddB. ... {code} In this case, mapFunction can return named iterators for A and B. B is automatically computed whenever A is computed, and vice versa. Since both are computed if any of them computed, subsequent reference to either one should not recompute any of them. The benefits of it: 1) no need to cache A; 2) only one pass of the input. I'm not sure if this is possible feasible in Spark, but Hive's map function is exactly doing this. It's operator tree can branch off anywhere, resulting multiple output datasets from a single input dataset. Please let me know if there are more questions. was (Author: xuefuz): Thanks for your comments, [~pwendell]. I understand caching A would be helpful if I need to transform it to get B and C separately. My proposal is to get B and C just by one pass of A, so A doens't even need to be cached. Here is an example how it may be used in Hive. {code} JavaPairRDD table = sparkContext.hadoopRDD(..); Map mappedRDDs = table.mapPartitions(mapFunction); JavaPairRDD rddA = mapperRDDs.get("A"); JavaPairRDD rddB = mapperRDDs.get("A"); JavaPairRDD sortedRddA = rddA.sortByKey(); javaPairRDD groupedRddB = rddB.groupByKey(); // further processing sortedRddA and groupedRddB. ... {code} In this case, mapFunction can return named iterators for A and B. B is automatically computed whenever A is computed, and vice versa. Since both are computed if any of them computed, subsequent reference to either one should not recompute any of them. The benefits of it: 1) no need to cache A; 2) only one pass of the input. I'm not sure if this is possible feasible in Spark, but Hive's map function is exactly doing this. It's operator tree can branch off anywhere, resulting multiple output datasets from a single input dataset. Please let me know if there are more questions. > Provide a custom transformation that can output multiple RDDs > - > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142881#comment-14142881 ] Xuefu Zhang commented on SPARK-3622: Thanks for your comments, [~pwendell]. I understand caching A would be helpful if I need to transform it to get B and C separately. My proposal is to get B and C just by one pass of A, so A doens't even need to be cached. Here is an example how it may be used in Hive. {code} JavaPairRDD table = sparkContext.hadoopRDD(..); Map mappedRDDs = table.mapPartitions(mapFunction); JavaPairRDD rddA = mapperRDDs.get("A"); JavaPairRDD rddB = mapperRDDs.get("A"); JavaPairRDD sortedRddA = rddA.sortByKey(); javaPairRDD groupedRddB = rddB.groupByKey(); // further processing sortedRddA and groupedRddB. ... {code} In this case, mapFunction can return named iterators for A and B. B is automatically computed whenever A is computed, and vice versa. Since both are computed if any of them computed, subsequent reference to either one should not recompute any of them. The benefits of it: 1) no need to cache A; 2) only one pass of the input. I'm not sure if this is possible feasible in Spark, but Hive's map function is exactly doing this. It's operator tree can branch off anywhere, resulting multiple output datasets from a single input dataset. Please let me know if there are more questions. > Provide a custom transformation that can output multiple RDDs > - > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3633) PR 1707/commit #4fde28c is problematic
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3633: --- Description: Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. was: Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) All the way until Aug 4th. Turns out the problem changeset is 4fde28c. > PR 1707/commit #4fde28c is problematic > -- > > Key: SPARK-3633 > URL: https://issues.apache.org/jira/browse/SPARK-3633 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.1.0 >Reporter: Nishkam Ravi > > Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. > Recently upgraded to Spark 1.1. The workload fails with the following error > message(s): > {code} >
[jira] [Created] (SPARK-3633) PR 1707/commit #4fde28c is problematic
Nishkam Ravi created SPARK-3633: --- Summary: PR 1707/commit #4fde28c is problematic Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on
[ https://issues.apache.org/jira/browse/SPARK-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142866#comment-14142866 ] Apache Spark commented on SPARK-3632: - User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/2484 > ConnectionManager can run out of receive threads with authentication on > --- > > Key: SPARK-3632 > URL: https://issues.apache.org/jira/browse/SPARK-3632 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > If you turn authentication on and you are using a lot of executors. There is > a chance that all the of the threads in the handleMessageExecutor could be > waiting to send a message because they are blocked waiting on authentication > to happen. This can cause a temporary deadlock until the connection times out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865 ] Patrick Wendell edited comment on SPARK-3622 at 9/22/14 3:24 AM: - Do you mind clarifying a little bit how hive would use this (maybe with a code example)? Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice). I think that doing this in the general case is not possible by definition. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A. was (Author: pwendell): Do you mind clarifying a little bit how hive would use this (maybe with a code example)? The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice). I think that doing this in the general case is not possible by definition. Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A. > Provide a custom transformation that can output multiple RDDs > - > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865 ] Patrick Wendell commented on SPARK-3622: Do you mind clarifying a little bit how hive would use this (maybe with a code example)? The normal way to do this if you want to avoid recomputing A would be to persist it, then use it to derive both B and C (this will do multiple passes on A, but it won't fully recompute A twice). I think that doing this in the general case is not possible by definition. Let's say you had a transformation that went from a single RDD A to two RDD's B and C. The user might use B and C at different times, so it's not possible to guarantee that A will be computed only once unless you persist A. > Provide a custom transformation that can output multiple RDDs > - > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860 ] RJ Nowling commented on SPARK-3614: --- Thanks, Andrew! I'll do that. -- em rnowl...@gmail.com c 954.496.2314 > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on
Thomas Graves created SPARK-3632: Summary: ConnectionManager can run out of receive threads with authentication on Key: SPARK-3632 URL: https://issues.apache.org/jira/browse/SPARK-3632 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3615) Kafka test should not hard code Zookeeper port
[ https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142855#comment-14142855 ] Saisai Shao commented on SPARK-3615: Hi Patrick, I've submit a PR to fix this issue, mind taking a look at the PR? Thanks a lot. > Kafka test should not hard code Zookeeper port > -- > > Key: SPARK-3615 > URL: https://issues.apache.org/jira/browse/SPARK-3615 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell >Assignee: Saisai Shao >Priority: Blocker > > This is causing failures in our master build if port 2181 is contented. > Instead of binding to a static port we should re-factor this such that it > opens a socket on port 0 and then reads back the port. So we can never have > contention. > {code} > sbt.ForkMain$ForkError: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:444) > at sun.nio.ch.Net.bind(Net.java:436) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95) > at > org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.(KafkaStreamSuite.scala:200) > at > org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62) > at > org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) > at org.junit.runners.ParentRunner.run(ParentRunner.java:300) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:24) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) > at org.junit.runners.ParentRunner.run(ParentRunner.java:300) > at org.junit.runner.JUnitCore.run(JUnitCore.java:157) > at org.junit.runner.JUnitCore.run(JUnitCore.java:136) > at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90) > at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223) > at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236) > at sbt.ForkMain$Run$2.call(ForkMain.java:294) > at sbt.ForkMain$Run$2.call(ForkMain.java:284) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3615) Kafka test should not hard code Zookeeper port
[ https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142847#comment-14142847 ] Apache Spark commented on SPARK-3615: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/2483 > Kafka test should not hard code Zookeeper port > -- > > Key: SPARK-3615 > URL: https://issues.apache.org/jira/browse/SPARK-3615 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Patrick Wendell >Assignee: Saisai Shao >Priority: Blocker > > This is causing failures in our master build if port 2181 is contented. > Instead of binding to a static port we should re-factor this such that it > opens a socket on port 0 and then reads back the port. So we can never have > contention. > {code} > sbt.ForkMain$ForkError: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:444) > at sun.nio.ch.Net.bind(Net.java:436) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95) > at > org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.(KafkaStreamSuite.scala:200) > at > org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62) > at > org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) > at org.junit.runners.ParentRunner.run(ParentRunner.java:300) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:24) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) > at org.junit.runners.ParentRunner.run(ParentRunner.java:300) > at org.junit.runner.JUnitCore.run(JUnitCore.java:157) > at org.junit.runner.JUnitCore.run(JUnitCore.java:136) > at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90) > at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223) > at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236) > at sbt.ForkMain$Run$2.call(ForkMain.java:294) > at sbt.ForkMain$Run$2.call(ForkMain.java:284) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3631) Add docs for checkpoint usage
Andrew Ash created SPARK-3631: - Summary: Add docs for checkpoint usage Key: SPARK-3631 URL: https://issues.apache.org/jira/browse/SPARK-3631 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Ash Assignee: Andrew Ash We should include general documentation on using checkpoints. Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core. Some content to consider for inclusion from [~brkyvz]: {quote} If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there. You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically). However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough. {quote} {quote} Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in and having to start from scratch. The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long. I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be performed before the lineage grows too long. {quote} A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-559) Automatically register all classes used in fields of a class with Kryo
[ https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142830#comment-14142830 ] Andrew Ash commented on SPARK-559: -- As of today in master we're using Twitter Chill version 0.3.6 which includes Kryo 2.21, so we are on the 2.x branch now > Automatically register all classes used in fields of a class with Kryo > -- > > Key: SPARK-559 > URL: https://issues.apache.org/jira/browse/SPARK-559 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3577: -- Description: The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into {{ExternalSorter}}. The write time recorded in those metrics is never used. We should probably add task metrics to report this spill time, since for shuffles, this would have previously been reported as part of shuffle write time (with the original hash-based sorter). (was: The ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. The write time recorded in those metrics is never used. We should probably add task metrics to report this spill time, since for shuffles, this would have previously been reported as part of shuffle write time (with the original hash-based sorter).) > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142827#comment-14142827 ] Andrew Ash commented on SPARK-3614: --- Great! I assigned this ticket to you RJ. Please try to have a draft commit within a couple weeks for review so others who might want to work on this can see progress being made. Otherwise it's best to leave tickets unassigned while no one is actively working on them. Thanks! > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3614: -- Assignee: RJ Nowling > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3630: -- Description: A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. was: A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Ankur Dave > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3400) GraphX unit tests fail nondeterministically
[ https://issues.apache.org/jira/browse/SPARK-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142823#comment-14142823 ] Andrew Ash commented on SPARK-3400: --- Filed as SPARK-3630 > GraphX unit tests fail nondeterministically > --- > > Key: SPARK-3400 > URL: https://issues.apache.org/jira/browse/SPARK-3400 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Blocker > Fix For: 1.1.1 > > > GraphX unit tests have been failing since the fix to SPARK-2823 was merged: > https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef. > Failures have appeared as Snappy parsing errors and shuffle > FileNotFoundExceptions. A local test showed that these failures occurred in > about 3/10 test runs. > Reverting the mentioned commit seems to solve the problem. Since this is > blocking everyone else, I'm submitting a hotfix to do that, and we can > diagnose the problem in more detail afterwards. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
Andrew Ash created SPARK-3630: - Summary: Identify cause of Kryo+Snappy PARSING_ERROR Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Assignee: Ankur Dave A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-2630: -- Description: Given one big file, such as text.4.3G, put it in one task, {code} sc.textFile("text.4.3.G").coalesce(1).count() {code} In Web UI of Spark, you will see that the input size is 5.4M. was: Given one big file, such as text.4.3G, put it in one task, sc.textFile("text.4.3.G").coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. > Input data size of CoalescedRDD is incorrect > > > Key: SPARK-2630 > URL: https://issues.apache.org/jira/browse/SPARK-2630 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.0, 1.0.1 >Reporter: Davies Liu >Assignee: Andrew Ash >Priority: Blocker > Attachments: overflow.tiff > > > Given one big file, such as text.4.3G, put it in one task, > {code} > sc.textFile("text.4.3.G").coalesce(1).count() > {code} > In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3629) Improvements to YARN doc
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3629: - Description: Right now this doc starts off with a big list of config options, and only then tells you how to submit an app. It would be better to put that part and the packaging part first, and the config options only at the end. In addition, the doc mentions yarn-cluster vs yarn-client as separate masters, which is inconsistent with the help output from spark-submit (which says to always use "yarn"). was:Right now this doc starts off with a big list of config options, and only then tells you how to submit an app. It would be better to put that part and the packaging part first, and the config options only at the end. > Improvements to YARN doc > > > Key: SPARK-3629 > URL: https://issues.apache.org/jira/browse/SPARK-3629 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Reporter: Matei Zaharia > > Right now this doc starts off with a big list of config options, and only > then tells you how to submit an app. It would be better to put that part and > the packaging part first, and the config options only at the end. > In addition, the doc mentions yarn-cluster vs yarn-client as separate > masters, which is inconsistent with the help output from spark-submit (which > says to always use "yarn"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3629) Improvements to YARN doc
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3629: - Summary: Improvements to YARN doc (was: Improve ordering of YARN doc) > Improvements to YARN doc > > > Key: SPARK-3629 > URL: https://issues.apache.org/jira/browse/SPARK-3629 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Reporter: Matei Zaharia > > Right now this doc starts off with a big list of config options, and only > then tells you how to submit an app. It would be better to put that part and > the packaging part first, and the config options only at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3629) Improve ordering of YARN doc
Matei Zaharia created SPARK-3629: Summary: Improve ordering of YARN doc Key: SPARK-3629 URL: https://issues.apache.org/jira/browse/SPARK-3629 Project: Spark Issue Type: Documentation Components: Documentation, YARN Reporter: Matei Zaharia Right now this doc starts off with a big list of config options, and only then tells you how to submit an app. It would be better to put that part and the packaging part first, and the config options only at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142772#comment-14142772 ] Josh Rosen commented on SPARK-2321: --- The scheduler has some data structures like StageInfo, TaskInfo, RDDInfo, etc. that expose some of the information that we might want in a user-facing progress API, but we can't expose these classes in their current form since they're marked @DeveloperAPI and are full of public, mutable fields (the responses returned from our progress / status API need to be immutable). Maybe we should stabilize these scheduler.*Info classes' public interfaces, make them immutable, and add a JobInfo class for capturing per-job information. We can then register a new, private SparkListener for maintaining a view of stage progress and add methods to SparkContext that provide stable, pull-based access to the snapshots of job/stage/task state. > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3628: - Target Version/s: 1.1.1, 1.2.0, 0.9.3, 1.0.3 (was: 1.1.1, 1.2.0, 1.0.3) > Don't apply accumulator updates multiple times for tasks in result stages > - > > Key: SPARK-3628 > URL: https://issues.apache.org/jira/browse/SPARK-3628 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Priority: Blocker > > In previous versions of Spark, accumulator updates only got applied once for > accumulators that are only used in actions (i.e. result stages), letting you > use them to deterministically compute a result. Unfortunately, this got > broken in some recent refactorings. > This is related to https://issues.apache.org/jira/browse/SPARK-732, but that > issue is about applying the same semantics to intermediate stages too, which > is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3628: - Target Version/s: 1.1.1, 1.2.0, 0.9.3, 1.0.3 (was: 1.1.1, 1.2.0, 1.0.3) > Don't apply accumulator updates multiple times for tasks in result stages > - > > Key: SPARK-3628 > URL: https://issues.apache.org/jira/browse/SPARK-3628 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Priority: Blocker > > In previous versions of Spark, accumulator updates only got applied once for > accumulators that are only used in actions (i.e. result stages), letting you > use them to deterministically compute a result. Unfortunately, this got > broken in some recent refactorings. > This is related to https://issues.apache.org/jira/browse/SPARK-732, but that > issue is about applying the same semantics to intermediate stages too, which > is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3628: - Target Version/s: 1.1.1, 1.2.0, 1.0.3 (was: 1.1.1, 1.2.0, 0.9.3, 1.0.3) > Don't apply accumulator updates multiple times for tasks in result stages > - > > Key: SPARK-3628 > URL: https://issues.apache.org/jira/browse/SPARK-3628 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Priority: Blocker > > In previous versions of Spark, accumulator updates only got applied once for > accumulators that are only used in actions (i.e. result stages), letting you > use them to deterministically compute a result. Unfortunately, this got > broken in some recent refactorings. > This is related to https://issues.apache.org/jira/browse/SPARK-732, but that > issue is about applying the same semantics to intermediate stages too, which > is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142756#comment-14142756 ] Matei Zaharia edited comment on SPARK-3628 at 9/21/14 10:49 PM: BTW the problem is that this used to be guarded against in the TaskSetManager (see https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L254 or https://github.com/apache/spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala#L436), and that went away at some point. was (Author: matei): BTW the problem is that this used to be guarded against in the TaskSetManager (see https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L254), and that went away at some point. > Don't apply accumulator updates multiple times for tasks in result stages > - > > Key: SPARK-3628 > URL: https://issues.apache.org/jira/browse/SPARK-3628 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Priority: Blocker > > In previous versions of Spark, accumulator updates only got applied once for > accumulators that are only used in actions (i.e. result stages), letting you > use them to deterministically compute a result. Unfortunately, this got > broken in some recent refactorings. > This is related to https://issues.apache.org/jira/browse/SPARK-732, but that > issue is about applying the same semantics to intermediate stages too, which > is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142756#comment-14142756 ] Matei Zaharia edited comment on SPARK-3628 at 9/21/14 10:43 PM: BTW the problem is that this used to be guarded against in the TaskSetManager (see https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L254), and that went away at some point. was (Author: matei): BTW the problem is that this used to be guarded against in the TaskSetManager (see https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L253), and that went away at some point. > Don't apply accumulator updates multiple times for tasks in result stages > - > > Key: SPARK-3628 > URL: https://issues.apache.org/jira/browse/SPARK-3628 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Priority: Blocker > > In previous versions of Spark, accumulator updates only got applied once for > accumulators that are only used in actions (i.e. result stages), letting you > use them to deterministically compute a result. Unfortunately, this got > broken in some recent refactorings. > This is related to https://issues.apache.org/jira/browse/SPARK-732, but that > issue is about applying the same semantics to intermediate stages too, which > is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142756#comment-14142756 ] Matei Zaharia commented on SPARK-3628: -- BTW the problem is that this used to be guarded against in the TaskSetManager (see https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L253), and that went away at some point. > Don't apply accumulator updates multiple times for tasks in result stages > - > > Key: SPARK-3628 > URL: https://issues.apache.org/jira/browse/SPARK-3628 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Priority: Blocker > > In previous versions of Spark, accumulator updates only got applied once for > accumulators that are only used in actions (i.e. result stages), letting you > use them to deterministically compute a result. Unfortunately, this got > broken in some recent refactorings. > This is related to https://issues.apache.org/jira/browse/SPARK-732, but that > issue is about applying the same semantics to intermediate stages too, which > is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
Matei Zaharia created SPARK-3628: Summary: Don't apply accumulator updates multiple times for tasks in result stages Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Priority: Blocker In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3627) spark on yarn reports success even though job fails
Thomas Graves created SPARK-3627: Summary: spark on yarn reports success even though job fails Key: SPARK-3627 URL: https://issues.apache.org/jira/browse/SPARK-3627 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Priority: Critical I was running a wordcount and saving the output to hdfs. If the output directory already exists, yarn reports success even though the job fails since it requires the output directory to not be there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3595. Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 Thanks I've merged this into master. We can consider merging this into 1.1 as well later on. I decided not to do that yet because we've often found that changes around Hadoop configurations can produce unanticipated regressions. So let's see how this fares in master and if there is lots of demand we can backport a fix once it's been stable in master for a while. > Spark should respect configured OutputCommitter when using saveAsHadoopFile > --- > > Key: SPARK-3595 > URL: https://issues.apache.org/jira/browse/SPARK-3595 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Ian Hummel >Assignee: Ian Hummel > Fix For: 1.2.0 > > > When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be > a {{FileOutputCommitter}}. > When using Spark on an EMR cluster to process and write files to/from S3, the > default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid > writing to a temporary directory and doing a copy. > Will submit a patch via GitHub shortly. > Cheers, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3595: --- Assignee: Ian Hummel > Spark should respect configured OutputCommitter when using saveAsHadoopFile > --- > > Key: SPARK-3595 > URL: https://issues.apache.org/jira/browse/SPARK-3595 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Ian Hummel >Assignee: Ian Hummel > > When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be > a {{FileOutputCommitter}}. > When using Spark on an EMR cluster to process and write files to/from S3, the > default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid > writing to a temporary directory and doing a copy. > Will submit a patch via GitHub shortly. > Cheers, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142631#comment-14142631 ] RJ Nowling commented on SPARK-3614: --- I would like to work on this. > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3626) Replace AsyncRDDActions with a more general async. API
[ https://issues.apache.org/jira/browse/SPARK-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142588#comment-14142588 ] Apache Spark commented on SPARK-3626: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2482 > Replace AsyncRDDActions with a more general async. API > -- > > Key: SPARK-3626 > URL: https://issues.apache.org/jira/browse/SPARK-3626 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > The experimental AsyncRDDActions APIs seem to only exist in order to enable > job cancellation. > We've been considering extending these APIs to support progress monitoring, > but this would require stabilizing them so they're no longer > {{@Experimental}}. > Instead, I propose to replace all of the AsyncRDDActions with a mechanism > based on job groups which allows arbitrary computations to be run in job > groups and supports cancellation / monitoring of Spark jobs launched from > those computations. > (full design pending; see my GitHub PR for more details). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-542) Cache Miss when machine have multiple hostname
[ https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-542: Priority: Minor (was: Blocker) > Cache Miss when machine have multiple hostname > -- > > Key: SPARK-542 > URL: https://issues.apache.org/jira/browse/SPARK-542 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: frankvictor >Priority: Minor > > HI, I encountered a weird runtime of pagerank in last few day. > After debugging the job, I found it was caused by the DNS name. > The machines of my cluster have multiple hostname, for example, slave 1 have > name (c001 and c001.cm.cluster) > when spark adding cache in cacheTracker, it get "c001" and add cache use it. > But when schedule task in SimpleJob, the msos offer give spark > "c001.cm.cluster". > so It will never get preferred location! > I thinks spark should handle the multiple hostname case(by using ip instead > of hostname, or some other methods). > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3626) Replace AsyncRDDActions with a more general async. API
Josh Rosen created SPARK-3626: - Summary: Replace AsyncRDDActions with a more general async. API Key: SPARK-3626 URL: https://issues.apache.org/jira/browse/SPARK-3626 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen The experimental AsyncRDDActions APIs seem to only exist in order to enable job cancellation. We've been considering extending these APIs to support progress monitoring, but this would require stabilizing them so they're no longer {{@Experimental}}. Instead, I propose to replace all of the AsyncRDDActions with a mechanism based on job groups which allows arbitrary computations to be run in job groups and supports cancellation / monitoring of Spark jobs launched from those computations. (full design pending; see my GitHub PR for more details). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-578) Fix interpreter code generation to only capture needed dependencies
[ https://issues.apache.org/jira/browse/SPARK-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142558#comment-14142558 ] Matthew Farrellee commented on SPARK-578: - [~matei] is this related to slimming down he assembly? > Fix interpreter code generation to only capture needed dependencies > --- > > Key: SPARK-578 > URL: https://issues.apache.org/jira/browse/SPARK-578 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying
[ https://issues.apache.org/jira/browse/SPARK-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-575. --- Resolution: Incomplete > Maintain a cache of JARs on each node to avoid unnecessary copying > -- > > Key: SPARK-575 > URL: https://issues.apache.org/jira/browse/SPARK-575 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying
[ https://issues.apache.org/jira/browse/SPARK-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142553#comment-14142553 ] Matthew Farrellee commented on SPARK-575: - [~joshrosen] is quite correct. this issue looks inactive. i'm going to close it out, but as always feel free to re-open. i can think of a few ways this could be done, and not all need spark code to be changed. > Maintain a cache of JARs on each node to avoid unnecessary copying > -- > > Key: SPARK-575 > URL: https://issues.apache.org/jira/browse/SPARK-575 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-584) Pass slave ip address when starting a cluster
[ https://issues.apache.org/jira/browse/SPARK-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142545#comment-14142545 ] Matthew Farrellee commented on SPARK-584: - what's the use case for this? > Pass slave ip address when starting a cluster > -- > > Key: SPARK-584 > URL: https://issues.apache.org/jira/browse/SPARK-584 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.6.0 >Priority: Minor > Attachments: 0001-fix-for-SPARK-584.patch > > > Pass slave ip address from conf while starting a cluster: > bin/start-slaves.sh is used to start all the slaves in the cluster. While the > slave class takes a --ip argument, we don't pass the ip address from the > conf/slaves. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-604) reconnect if mesos slaves dies
[ https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-604: Component/s: Mesos > reconnect if mesos slaves dies > -- > > Key: SPARK-604 > URL: https://issues.apache.org/jira/browse/SPARK-604 > Project: Spark > Issue Type: Bug > Components: Mesos > > when running on mesos, if a slave goes down, spark doesn't try to reassign > the work to another machine. Even if the slave comes back up, the job is > doomed. > Currently when this happens, we just see this in the driver logs: > 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost: > 201210312057-1560611338-5050-24091-52 > Exception in thread "Thread-346" java.util.NoSuchElementException: key not > found: value: "201210312057-1560611338-5050-24091-52" > at scala.collection.MapLike$class.default(MapLike.scala:224) > at scala.collection.mutable.HashMap.default(HashMap.scala:43) > at scala.collection.MapLike$class.apply(MapLike.scala:135) > at scala.collection.mutable.HashMap.apply(HashMap.scala:43) > at > spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255) > at > spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275) > 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned > with code DRIVER_ABORTED -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-610) Support master failover in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142528#comment-14142528 ] Matthew Farrellee commented on SPARK-610: - [~matei] given YARN and Mesos implementations, is this something the standalone mode should strive to do? > Support master failover in standalone mode > -- > > Key: SPARK-610 > URL: https://issues.apache.org/jira/browse/SPARK-610 > Project: Spark > Issue Type: New Feature >Reporter: Matei Zaharia > > The standalone deploy mode is quite simple, which shouldn't make it too bad > to add support for master failover using ZooKeeper or something similar. This > would really up its usefulness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142524#comment-14142524 ] Sandy Ryza commented on SPARK-3577: --- No problem. Yeah, I agree that a spill time metric would be useful. > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > The ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. > The write time recorded in those metrics is never used. We should probably > add task metrics to report this spill time, since for shuffles, this would > have previously been reported as part of shuffle write time (with the > original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs
[ https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142518#comment-14142518 ] Apache Spark commented on SPARK-2058: - User 'EugenCepoi' has created a pull request for this issue: https://github.com/apache/spark/pull/2481 > SPARK_CONF_DIR should override all present configs > -- > > Key: SPARK-2058 > URL: https://issues.apache.org/jira/browse/SPARK-2058 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.0, 1.0.1, 1.1.0 >Reporter: Eugen Cepoi >Priority: Critical > > When the user defines SPARK_CONF_DIR I think spark should use all the configs > available there not only spark-env. > This involves changing SparkSubmitArguments to first read from > SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the > computed classpath for configs such as log4j, metrics, etc. > I have already prepared a PR for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-690) Stack overflow when running pagerank more than 10000 iterators
[ https://issues.apache.org/jira/browse/SPARK-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-690. --- Resolution: Unresolved > Stack overflow when running pagerank more than 1 iterators > -- > > Key: SPARK-690 > URL: https://issues.apache.org/jira/browse/SPARK-690 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.6.1 >Reporter: xiajunluan > > when I run PageRank example more than 1 iterators, Job client will report > stack overflow errors. > 13/02/07 13:41:40 INFO CacheTracker: Registering RDD ID 57993 with cache > Exception in thread "DAGScheduler" java.lang.StackOverflowError > at > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:467) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1281) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731) > at > org.jboss.netty.akka.util.HashedWheelTimer.scheduleTimeout(HashedWheelTimer.java:277) > at > org.jboss.netty.akka.util.HashedWheelTimer.newTimeout(HashedWheelTimer.java:264) > at akka.actor.DefaultScheduler.scheduleOnce(Scheduler.scala:186) > at akka.pattern.PromiseActorRef$.apply(AskSupport.scala:274) > at akka.pattern.AskSupport$class.ask(AskSupport.scala:83) > at akka.pattern.package$.ask(package.scala:43) > at akka.pattern.AskSupport$AskableActorRef.ask(AskSupport.scala:123) > at spark.CacheTracker.askTracker(CacheTracker.scala:121) > at spark.CacheTracker.communicate(CacheTracker.scala:131) > at spark.CacheTracker.registerRDD(CacheTracker.scala:142) > at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:149) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:155) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:76) > at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:150) > at spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:160) > at spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:131) > at > spark.scheduler.DAGScheduler.getShuffleMapStage(DAGScheduler.scala:111) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:153) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-690) Stack overflow when running pagerank more than 10000 iterators
[ https://issues.apache.org/jira/browse/SPARK-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142511#comment-14142511 ] Matthew Farrellee commented on SPARK-690: - [~andrew xia] this is reported against a very old version. i'm going to close it out, but if you can reproduce please re-open > Stack overflow when running pagerank more than 1 iterators > -- > > Key: SPARK-690 > URL: https://issues.apache.org/jira/browse/SPARK-690 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.6.1 >Reporter: xiajunluan > > when I run PageRank example more than 1 iterators, Job client will report > stack overflow errors. > 13/02/07 13:41:40 INFO CacheTracker: Registering RDD ID 57993 with cache > Exception in thread "DAGScheduler" java.lang.StackOverflowError > at > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:467) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1281) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731) > at > org.jboss.netty.akka.util.HashedWheelTimer.scheduleTimeout(HashedWheelTimer.java:277) > at > org.jboss.netty.akka.util.HashedWheelTimer.newTimeout(HashedWheelTimer.java:264) > at akka.actor.DefaultScheduler.scheduleOnce(Scheduler.scala:186) > at akka.pattern.PromiseActorRef$.apply(AskSupport.scala:274) > at akka.pattern.AskSupport$class.ask(AskSupport.scala:83) > at akka.pattern.package$.ask(package.scala:43) > at akka.pattern.AskSupport$AskableActorRef.ask(AskSupport.scala:123) > at spark.CacheTracker.askTracker(CacheTracker.scala:121) > at spark.CacheTracker.communicate(CacheTracker.scala:131) > at spark.CacheTracker.registerRDD(CacheTracker.scala:142) > at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:149) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:155) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:76) > at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:150) > at spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:160) > at spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:131) > at > spark.scheduler.DAGScheduler.getShuffleMapStage(DAGScheduler.scala:111) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:153) > at > spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-709) Dropping a block reports 0 bytes
[ https://issues.apache.org/jira/browse/SPARK-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-709. --- Resolution: Incomplete [~rxin] there isn't enough information to make progress on this, but feel free to re-open if you so desire. > Dropping a block reports 0 bytes > > > Key: SPARK-709 > URL: https://issues.apache.org/jira/browse/SPARK-709 > Project: Spark > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-718) NPE when performing action during transformation
[ https://issues.apache.org/jira/browse/SPARK-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142506#comment-14142506 ] Matthew Farrellee commented on SPARK-718: - Spark simply does not support nesting RDDs in this fashion. you'll get a more prompt response and information with the user list, see http://spark.apache.org/community.html. i'm going to close this issue, but if you want feel free to re-open it. > NPE when performing action during transformation > > > Key: SPARK-718 > URL: https://issues.apache.org/jira/browse/SPARK-718 > Project: Spark > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Krzywicki > > Running the spark shell: > The following code fails with a NPE when trying to collect the resulting RDD: > {code:java} > val data = sc.parallelize(1 to 10) > data.map(i => data.count).collect > {code} > {code:java} > ERROR local.LocalScheduler: Exception in task 0 > java.lang.NullPointerException > at spark.RDD.count(RDD.scala:490) > at > $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcJI$sp(:15) > at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15) > at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15) > at scala.collection.Iterator$$anon$19.next(Iterator.scala:401) > at scala.collection.Iterator$class.foreach(Iterator.scala:772) > at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250) > at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:399) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237) > at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:399) > at spark.RDD$$anonfun$1.apply(RDD.scala:389) > at spark.RDD$$anonfun$1.apply(RDD.scala:389) > at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610) > at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610) > at spark.scheduler.ResultTask.run(ResultTask.scala:76) > at > spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74) > at > spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-718) NPE when performing action during transformation
[ https://issues.apache.org/jira/browse/SPARK-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-718. --- Resolution: Done > NPE when performing action during transformation > > > Key: SPARK-718 > URL: https://issues.apache.org/jira/browse/SPARK-718 > Project: Spark > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Krzywicki > > Running the spark shell: > The following code fails with a NPE when trying to collect the resulting RDD: > {code:java} > val data = sc.parallelize(1 to 10) > data.map(i => data.count).collect > {code} > {code:java} > ERROR local.LocalScheduler: Exception in task 0 > java.lang.NullPointerException > at spark.RDD.count(RDD.scala:490) > at > $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcJI$sp(:15) > at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15) > at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15) > at scala.collection.Iterator$$anon$19.next(Iterator.scala:401) > at scala.collection.Iterator$class.foreach(Iterator.scala:772) > at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250) > at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:399) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237) > at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:399) > at spark.RDD$$anonfun$1.apply(RDD.scala:389) > at spark.RDD$$anonfun$1.apply(RDD.scala:389) > at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610) > at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610) > at spark.scheduler.ResultTask.run(ResultTask.scala:76) > at > spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74) > at > spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-567) Unified directory structure for temporary data
[ https://issues.apache.org/jira/browse/SPARK-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-567. --- Resolution: Incomplete please re-open with additional details for how this could be implemented > Unified directory structure for temporary data > -- > > Key: SPARK-567 > URL: https://issues.apache.org/jira/browse/SPARK-567 > Project: Spark > Issue Type: Improvement >Reporter: Mosharaf Chowdhury > > Broadcast, shuffle, and unforeseen use cases should use the same directory > structure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-559) Automatically register all classes used in fields of a class with Kryo
[ https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142479#comment-14142479 ] Matthew Farrellee commented on SPARK-559: - the last comment on this, from 2 years ago, suggest this is resolved w/ an upgrade to kryo 2.x. i'm going to close this, but please re-open if you disagree. > Automatically register all classes used in fields of a class with Kryo > -- > > Key: SPARK-559 > URL: https://issues.apache.org/jira/browse/SPARK-559 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-559) Automatically register all classes used in fields of a class with Kryo
[ https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-559. --- Resolution: Done > Automatically register all classes used in fields of a class with Kryo > -- > > Key: SPARK-559 > URL: https://issues.apache.org/jira/browse/SPARK-559 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-550) Hiding the default spark context in the spark shell creates serialization issues
[ https://issues.apache.org/jira/browse/SPARK-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142477#comment-14142477 ] Matthew Farrellee commented on SPARK-550: - a lot of code has changed in this space over the past 2 years. i'm going to close this, but feel free to re-open if you feel it's still an issue. > Hiding the default spark context in the spark shell creates serialization > issues > > > Key: SPARK-550 > URL: https://issues.apache.org/jira/browse/SPARK-550 > Project: Spark > Issue Type: Bug >Reporter: tjhunter > > I copy-pasted a piece of code along these lines in the spark shell: > ... > val sc = new SparkContext("local[%d]" format num_splits,"myframework") > val my_rdd = sc.textFile(...) > my_rdd.count() > This leads to the shell crashing with a java.io.NotSerializableException: > spark.SparkContext > It took me a while to realize it was due to the new spark context created. > Maybe a warning/error should be triggered if the user tries to change the > definition of sc? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-550) Hiding the default spark context in the spark shell creates serialization issues
[ https://issues.apache.org/jira/browse/SPARK-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-550. --- Resolution: Done > Hiding the default spark context in the spark shell creates serialization > issues > > > Key: SPARK-550 > URL: https://issues.apache.org/jira/browse/SPARK-550 > Project: Spark > Issue Type: Bug >Reporter: tjhunter > > I copy-pasted a piece of code along these lines in the spark shell: > ... > val sc = new SparkContext("local[%d]" format num_splits,"myframework") > val my_rdd = sc.textFile(...) > my_rdd.count() > This leads to the shell crashing with a java.io.NotSerializableException: > spark.SparkContext > It took me a while to realize it was due to the new spark context created. > Maybe a warning/error should be triggered if the user tries to change the > definition of sc? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-542) Cache Miss when machine have multiple hostname
[ https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-542: Component/s: Mesos Priority: Blocker > Cache Miss when machine have multiple hostname > -- > > Key: SPARK-542 > URL: https://issues.apache.org/jira/browse/SPARK-542 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: frankvictor >Priority: Blocker > > HI, I encountered a weird runtime of pagerank in last few day. > After debugging the job, I found it was caused by the DNS name. > The machines of my cluster have multiple hostname, for example, slave 1 have > name (c001 and c001.cm.cluster) > when spark adding cache in cacheTracker, it get "c001" and add cache use it. > But when schedule task in SimpleJob, the msos offer give spark > "c001.cm.cluster". > so It will never get preferred location! > I thinks spark should handle the multiple hostname case(by using ip instead > of hostname, or some other methods). > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-538) INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone
[ https://issues.apache.org/jira/browse/SPARK-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-538. --- Resolution: Done > INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone > - > > Key: SPARK-538 > URL: https://issues.apache.org/jira/browse/SPARK-538 > Project: Spark > Issue Type: Bug >Reporter: vince67 > > Hi Matei, >Maybe I can't descibe it clearly. >We run masters or slaves on different machines,it is success. >But when we run spark.examples.SparkPi on the master , our > process hangs,we have not got the result. >Descirption like these: > > > 12/09/02 16:47:54 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes > = 339585269 > 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Registered actor on port 7077 > 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Started slave cache (size > 323.9MB) on vince67-ThinkCentre- > 12/09/02 16:47:54 INFO spark.MapOutputTrackerActor: Registered actor on port > 7077 > 12/09/02 16:47:54 INFO spark.ShuffleManager: Shuffle dir: > /tmp/spark-local-3e79b235-1b94-44d1-823b-0369f6698688/shuffle > 12/09/02 16:47:54 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/02 16:47:54 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:49578 STARTING > 12/09/02 16:47:54 INFO spark.ShuffleManager: Local URI: > http://ip.ip.ip.ip:49578 > 12/09/02 16:47:55 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/02 16:47:55 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:49600 STARTING > 12/09/02 16:47:55 INFO broadcast.HttpBroadcast: Broadcast server started at > http://ip.ip.ip.ip:49600 > 12/09/02 16:47:55 INFO spark.MesosScheduler: Registered as framework ID > 201209021640-74572372-5050-16898-0004 > 12/09/02 16:47:55 INFO spark.SparkContext: Starting job... > 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 1 with cache > 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 > partitions > 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 0 with cache > 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 > partitions > 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Asked for current cache > locations > 12/09/02 16:47:55 INFO spark.MesosScheduler: Final stage: Stage 0 > 12/09/02 16:47:55 INFO spark.MesosScheduler: Parents of final stage: List() > 12/09/02 16:47:55 INFO spark.MesosScheduler: Missing parents: List() > 12/09/02 16:47:55 INFO spark.MesosScheduler: Submitting Stage 0, which has no > missing parents > 12/09/02 16:47:55 INFO spark.MesosScheduler: Got a job with 2 tasks > 12/09/02 16:47:55 INFO spark.MesosScheduler: Adding job with ID 0 > 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 151 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 0 (task 0:0) > 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 1 (task 0:1) > 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:1 as TID 3 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 5 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 2 (task 0:0) > 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:0 as TID 4 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 3 (task 0:1) > 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:1 as TID 5 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 2 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:58 INFO spark.SimpleJob: Lost TID 4 (task 0:0)
[jira] [Commented] (SPARK-538) INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone
[ https://issues.apache.org/jira/browse/SPARK-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142475#comment-14142475 ] Matthew Farrellee commented on SPARK-538: - this is a reasonable question for the user list, see http://spark.apache.org/community.html. i'm going to close this in favor of user list interaction. if you disagree, please re-open. > INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone > - > > Key: SPARK-538 > URL: https://issues.apache.org/jira/browse/SPARK-538 > Project: Spark > Issue Type: Bug >Reporter: vince67 > > Hi Matei, >Maybe I can't descibe it clearly. >We run masters or slaves on different machines,it is success. >But when we run spark.examples.SparkPi on the master , our > process hangs,we have not got the result. >Descirption like these: > > > 12/09/02 16:47:54 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes > = 339585269 > 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Registered actor on port 7077 > 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Started slave cache (size > 323.9MB) on vince67-ThinkCentre- > 12/09/02 16:47:54 INFO spark.MapOutputTrackerActor: Registered actor on port > 7077 > 12/09/02 16:47:54 INFO spark.ShuffleManager: Shuffle dir: > /tmp/spark-local-3e79b235-1b94-44d1-823b-0369f6698688/shuffle > 12/09/02 16:47:54 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/02 16:47:54 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:49578 STARTING > 12/09/02 16:47:54 INFO spark.ShuffleManager: Local URI: > http://ip.ip.ip.ip:49578 > 12/09/02 16:47:55 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/02 16:47:55 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:49600 STARTING > 12/09/02 16:47:55 INFO broadcast.HttpBroadcast: Broadcast server started at > http://ip.ip.ip.ip:49600 > 12/09/02 16:47:55 INFO spark.MesosScheduler: Registered as framework ID > 201209021640-74572372-5050-16898-0004 > 12/09/02 16:47:55 INFO spark.SparkContext: Starting job... > 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 1 with cache > 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 > partitions > 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 0 with cache > 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 > partitions > 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Asked for current cache > locations > 12/09/02 16:47:55 INFO spark.MesosScheduler: Final stage: Stage 0 > 12/09/02 16:47:55 INFO spark.MesosScheduler: Parents of final stage: List() > 12/09/02 16:47:55 INFO spark.MesosScheduler: Missing parents: List() > 12/09/02 16:47:55 INFO spark.MesosScheduler: Submitting Stage 0, which has no > missing parents > 12/09/02 16:47:55 INFO spark.MesosScheduler: Got a job with 2 tasks > 12/09/02 16:47:55 INFO spark.MesosScheduler: Adding job with ID 0 > 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 151 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 0 (task 0:0) > 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 1 (task 0:1) > 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:1 as TID 3 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 5 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 2 (task 0:0) > 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:0 as TID 4 on slave > 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred) > 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 3 (task 0:1) > 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:1 as TID 5 on slave > 201209021640-74572372-5050-16898-2: lmr
[jira] [Resolved] (SPARK-537) driver.run() returned with code DRIVER_ABORTED
[ https://issues.apache.org/jira/browse/SPARK-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee resolved SPARK-537. - Resolution: Fixed Fix Version/s: 1.0.0 > driver.run() returned with code DRIVER_ABORTED > -- > > Key: SPARK-537 > URL: https://issues.apache.org/jira/browse/SPARK-537 > Project: Spark > Issue Type: Bug >Reporter: yshaw > Fix For: 1.0.0 > > > Hi there, > When I try to run Spark on Mesos as a cluster, some error happen like this: > ``` > ./run spark.examples.SparkPi *.*.*.*:5050 > 12/09/07 14:49:28 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes > = 994836480 > 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Registered actor on port 7077 > 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Started slave cache (size > 948.8MB) on shawpc > 12/09/07 14:49:28 INFO spark.MapOutputTrackerActor: Registered actor on port > 7077 > 12/09/07 14:49:28 INFO spark.ShuffleManager: Shuffle dir: > /tmp/spark-local-81220c47-bc43-4809-ac48-5e3e8e023c8a/shuffle > 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/07 14:49:28 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:57595 STARTING > 12/09/07 14:49:28 INFO spark.ShuffleManager: Local URI: http://127.0.1.1:57595 > 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/07 14:49:28 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:60113 STARTING > 12/09/07 14:49:28 INFO broadcast.HttpBroadcast: Broadcast server started at > http://127.0.1.1:60113 > 12/09/07 14:49:28 INFO spark.MesosScheduler: Temp directory for JARs: > /tmp/spark-d541f37c-ae35-476c-b2fc-9908b0739f50 > 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/07 14:49:28 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:50511 STARTING > 12/09/07 14:49:28 INFO spark.MesosScheduler: JAR server started at > http://127.0.1.1:50511 > 12/09/07 14:49:28 INFO spark.MesosScheduler: Registered as framework ID > 201209071448-846324308-5050-26925- > 12/09/07 14:49:29 INFO spark.SparkContext: Starting job... > 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache > 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 > partitions > 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache > 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 > partitions > 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Asked for current cache > locations > 12/09/07 14:49:29 INFO spark.MesosScheduler: Final stage: Stage 0 > 12/09/07 14:49:29 INFO spark.MesosScheduler: Parents of final stage: List() > 12/09/07 14:49:29 INFO spark.MesosScheduler: Missing parents: List() > 12/09/07 14:49:29 INFO spark.MesosScheduler: Submitting Stage 0, which has no > missing parents > 12/09/07 14:49:29 INFO spark.MesosScheduler: Got a job with 2 tasks > 12/09/07 14:49:29 INFO spark.MesosScheduler: Adding job with ID 0 > 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 52 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 0 (task 0:0) > 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 0 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 1 (task 0:1) > 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 2 (task 0:0) > 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 3 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 2 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:1 as TID 4 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:32 INFO spark.SimpleJob: Lost TID 3 (task 0:0) > 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:0 as TID 5 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > t
[jira] [Commented] (SPARK-537) driver.run() returned with code DRIVER_ABORTED
[ https://issues.apache.org/jira/browse/SPARK-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142474#comment-14142474 ] Matthew Farrellee commented on SPARK-537: - this should be resolved by a number of fixes in 1.0. please re-open if it still reproduces. > driver.run() returned with code DRIVER_ABORTED > -- > > Key: SPARK-537 > URL: https://issues.apache.org/jira/browse/SPARK-537 > Project: Spark > Issue Type: Bug >Reporter: yshaw > > Hi there, > When I try to run Spark on Mesos as a cluster, some error happen like this: > ``` > ./run spark.examples.SparkPi *.*.*.*:5050 > 12/09/07 14:49:28 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes > = 994836480 > 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Registered actor on port 7077 > 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Started slave cache (size > 948.8MB) on shawpc > 12/09/07 14:49:28 INFO spark.MapOutputTrackerActor: Registered actor on port > 7077 > 12/09/07 14:49:28 INFO spark.ShuffleManager: Shuffle dir: > /tmp/spark-local-81220c47-bc43-4809-ac48-5e3e8e023c8a/shuffle > 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/07 14:49:28 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:57595 STARTING > 12/09/07 14:49:28 INFO spark.ShuffleManager: Local URI: http://127.0.1.1:57595 > 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/07 14:49:28 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:60113 STARTING > 12/09/07 14:49:28 INFO broadcast.HttpBroadcast: Broadcast server started at > http://127.0.1.1:60113 > 12/09/07 14:49:28 INFO spark.MesosScheduler: Temp directory for JARs: > /tmp/spark-d541f37c-ae35-476c-b2fc-9908b0739f50 > 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011 > 12/09/07 14:49:28 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:50511 STARTING > 12/09/07 14:49:28 INFO spark.MesosScheduler: JAR server started at > http://127.0.1.1:50511 > 12/09/07 14:49:28 INFO spark.MesosScheduler: Registered as framework ID > 201209071448-846324308-5050-26925- > 12/09/07 14:49:29 INFO spark.SparkContext: Starting job... > 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache > 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 > partitions > 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache > 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 > partitions > 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Asked for current cache > locations > 12/09/07 14:49:29 INFO spark.MesosScheduler: Final stage: Stage 0 > 12/09/07 14:49:29 INFO spark.MesosScheduler: Parents of final stage: List() > 12/09/07 14:49:29 INFO spark.MesosScheduler: Missing parents: List() > 12/09/07 14:49:29 INFO spark.MesosScheduler: Submitting Stage 0, which has no > missing parents > 12/09/07 14:49:29 INFO spark.MesosScheduler: Got a job with 2 tasks > 12/09/07 14:49:29 INFO spark.MesosScheduler: Adding job with ID 0 > 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 52 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 0 (task 0:0) > 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 0 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 1 (task 0:1) > 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 2 (task 0:0) > 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 3 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and > took 2 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:1 as TID 4 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and > took 1 ms to serialize by spark.JavaSerializerInstance > 12/09/07 14:49:32 INFO spark.SimpleJob: Lost TID 3 (task 0:0) > 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:0 as TID 5 on slave > 201209071448-846324308-5050-26925-0: shawpc (preferred) > 12/09/07 14
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142471#comment-14142471 ] Apache Spark commented on SPARK-3625: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2480 > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Blocker > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)) > c.count > c.checkpoint() > c.count > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3625: --- Description: The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)) c.count c.checkpoint() c.count {code} was: The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)) c.count c.checkpoint() {code} > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Blocker > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)) > c.count > c.checkpoint() > c.count > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3625) In some cases, the RDD.checkpoint does not work
Guoqiang Li created SPARK-3625: -- Summary: In some cases, the RDD.checkpoint does not work Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.0.2 Reporter: Guoqiang Li Priority: Blocker The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)) c.count c.checkpoint() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3593) Support Sorting of Binary Type Data
[ https://issues.apache.org/jira/browse/SPARK-3593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142468#comment-14142468 ] Matthew Farrellee commented on SPARK-3593: -- [~pmagid] will you provide some example code that demonstrates your issue? > Support Sorting of Binary Type Data > --- > > Key: SPARK-3593 > URL: https://issues.apache.org/jira/browse/SPARK-3593 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: Paul Magid > > If you try sorting on a binary field you currently get an exception. Please > add support for binary data type sorting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142466#comment-14142466 ] Xuefu Zhang commented on SPARK-3621: I understand RDD is a concept existing only in the driver. However, accessing the data in Spark job doesn't have to be in the form of RDD. An iterator over the underlying data is sufficient, as long as the data is already shipped to the node when the job starts to run. One way to identify the shipped RDD and the iterator afterwards could be a UUID. Hive on Spark isn't using Spark's transformations to do map-join, or join in general. Hive's own implementation is to build hash maps for the small tables when the join starts, and then do key lookups while streaming thru the big table. For this, small table data (which can be a result RDD of another Spark job) needs to be shipped to all nodes that do the join. > Provide a way to broadcast an RDD (instead of just a variable made of the > RDD) so that a job can access > --- > > Key: SPARK-3621 > URL: https://issues.apache.org/jira/browse/SPARK-3621 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Xuefu Zhang > > In some cases, such as Hive's way of doing map-side join, it would be > benefcial to allow client program to broadcast RDDs rather than just > variables made of these RDDs. Broadcasting a variable made of RDDs requires > all RDD data be collected to the driver and that the variable be shipped to > the cluster after being made. It would be more performing if driver just > broadcasts the RDDs and uses the corresponding data in jobs (such building > hashmaps at executors). > Tez has a broadcast edge which can ship data from previous stage to the next > stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-637) Create troubleshooting checklist
[ https://issues.apache.org/jira/browse/SPARK-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142463#comment-14142463 ] Matthew Farrellee commented on SPARK-637: - this is a good idea, and it will take a significant amount of effort. it looks like nothing has happened for almost 2 years. i'm going to close this, but feel free to re-open and push forward with it. > Create troubleshooting checklist > > > Key: SPARK-637 > URL: https://issues.apache.org/jira/browse/SPARK-637 > Project: Spark > Issue Type: New Feature > Components: Documentation >Reporter: Josh Rosen > > We should provide a checklist for troubleshooting common Spark problems. > For example, it could include steps like "check that the Spark code was > copied to all nodes" and "check that the workers successfully connect to the > master." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-637) Create troubleshooting checklist
[ https://issues.apache.org/jira/browse/SPARK-637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-637. --- Resolution: Later > Create troubleshooting checklist > > > Key: SPARK-637 > URL: https://issues.apache.org/jira/browse/SPARK-637 > Project: Spark > Issue Type: New Feature > Components: Documentation >Reporter: Josh Rosen > > We should provide a checklist for troubleshooting common Spark problems. > For example, it could include steps like "check that the Spark code was > copied to all nodes" and "check that the workers successfully connect to the > master." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-719) Add FAQ page to documentation or webpage
[ https://issues.apache.org/jira/browse/SPARK-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142459#comment-14142459 ] Matthew Farrellee commented on SPARK-719: - it looks like this has some good content, but it's stale and likely needs vetting. the new FAQ location is http://spark.apache.org/faq.html i'm going to close this since there has been no progress. note - it'll still be available via search feel free to re-open if you disagree. > Add FAQ page to documentation or webpage > > > Key: SPARK-719 > URL: https://issues.apache.org/jira/browse/SPARK-719 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Andy Konwinski >Assignee: Andy Konwinski > > Lots of issues on the mailing list are redundant (e.g., Patrick mentioned > this question has been asked/answered multiple times > https://groups.google.com/d/msg/spark-users/-mYn6BF-Y5Y/8qeXuxs8_d0J). > We should put the solutions to common problems on an FAQ page in the > documentation or on the webpage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-719) Add FAQ page to documentation or webpage
[ https://issues.apache.org/jira/browse/SPARK-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-719. --- Resolution: Done Fix Version/s: (was: 0.7.1) > Add FAQ page to documentation or webpage > > > Key: SPARK-719 > URL: https://issues.apache.org/jira/browse/SPARK-719 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Andy Konwinski >Assignee: Andy Konwinski > > Lots of issues on the mailing list are redundant (e.g., Patrick mentioned > this question has been asked/answered multiple times > https://groups.google.com/d/msg/spark-users/-mYn6BF-Y5Y/8qeXuxs8_d0J). > We should put the solutions to common problems on an FAQ page in the > documentation or on the webpage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-614) Make last 4 digits of framework id in standalone mode logging monotonically increasing
[ https://issues.apache.org/jira/browse/SPARK-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142456#comment-14142456 ] Matthew Farrellee commented on SPARK-614: - it looks like nothing has happened with this in the past 23 months. i'm going to close this, but feel free to re-open. > Make last 4 digits of framework id in standalone mode logging monotonically > increasing > -- > > Key: SPARK-614 > URL: https://issues.apache.org/jira/browse/SPARK-614 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Denny Britz > > In mesos mode, the work log directories are monotonically increasing, and > makes it very easy to spot a folder and go into it (e.g. only need to type > *[last4digit]). > We lost this in the standalone mode, as seen in this example. The last four > digits would go up and down > drwxr-xr-x 3 root root 4096 Nov 8 08:03 job-20121108080355- > drwxr-xr-x 3 root root 4096 Nov 8 08:04 job-20121108080450-0001 > drwxr-xr-x 3 root root 4096 Nov 8 08:07 job-20121108080757-0002 > drwxr-xr-x 3 root root 4096 Nov 8 08:10 job-20121108081014-0003 > drwxr-xr-x 3 root root 4096 Nov 8 08:23 job-20121108082316-0004 > drwxr-xr-x 3 root root 4096 Nov 8 08:26 job-20121108082616-0005 > drwxr-xr-x 3 root root 4096 Nov 8 08:30 job-20121108083034-0006 > drwxr-xr-x 3 root root 4096 Nov 8 08:35 job-20121108083514-0007 > drwxr-xr-x 3 root root 4096 Nov 8 08:38 job-20121108083807-0008 > drwxr-xr-x 3 root root 4096 Nov 8 08:41 job-20121108084105-0009 > drwxr-xr-x 3 root root 4096 Nov 8 08:42 job-20121108084242-0010 > drwxr-xr-x 3 root root 4096 Nov 8 08:45 job-20121108084512-0011 > drwxr-xr-x 3 root root 4096 Nov 8 09:01 job-20121108090113- > drwxr-xr-x 3 root root 4096 Nov 8 09:15 job-20121108091536-0001 > drwxr-xr-x 3 root root 4096 Nov 8 09:24 job-20121108092341-0003 > drwxr-xr-x 3 root root 4096 Nov 8 09:27 job-20121108092703- > drwxr-xr-x 3 root root 4096 Nov 8 09:46 job-20121108094629-0001 > drwxr-xr-x 3 root root 4096 Nov 8 09:48 job-20121108094809-0002 > drwxr-xr-x 3 root root 4096 Nov 8 10:04 job-20121108100418-0003 > drwxr-xr-x 3 root root 4096 Nov 8 10:18 job-20121108101814-0004 > drwxr-xr-x 3 root root 4096 Nov 8 10:22 job-20121108102207-0005 > drwxr-xr-x 3 root root 4096 Nov 8 18:48 job-20121108184842-0006 > drwxr-xr-x 3 root root 4096 Nov 8 18:49 job-20121108184932-0007 > drwxr-xr-x 3 root root 4096 Nov 8 18:50 job-20121108185007-0008 > drwxr-xr-x 3 root root 4096 Nov 8 18:50 job-20121108185040-0009 > drwxr-xr-x 3 root root 4096 Nov 8 18:51 job-20121108185127-0010 > drwxr-xr-x 3 root root 4096 Nov 8 18:54 job-20121108185428-0011 > drwxr-xr-x 3 root root 4096 Nov 8 18:58 job-20121108185837-0012 > drwxr-xr-x 3 root root 4096 Nov 8 18:58 job-20121108185854-0013 > drwxr-xr-x 3 root root 4096 Nov 8 19:00 job-20121108190005-0014 > drwxr-xr-x 3 root root 4096 Nov 8 19:00 job-20121108190059-0015 > drwxr-xr-x 3 root root 4096 Nov 8 19:10 job-20121108191010-0016 > drwxr-xr-x 3 root root 4096 Nov 8 19:15 job-20121108191508-0017 > drwxr-xr-x 3 root root 4096 Nov 8 19:21 job-20121108192125-0018 > drwxr-xr-x 3 root root 4096 Nov 8 19:23 job-20121108192329-0019 > drwxr-xr-x 3 root root 4096 Nov 8 19:26 job-20121108192638-0020 > drwxr-xr-x 3 root root 4096 Nov 8 19:35 job-20121108193554-0022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-614) Make last 4 digits of framework id in standalone mode logging monotonically increasing
[ https://issues.apache.org/jira/browse/SPARK-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-614. --- Resolution: Unresolved Fix Version/s: (was: 0.7.1) > Make last 4 digits of framework id in standalone mode logging monotonically > increasing > -- > > Key: SPARK-614 > URL: https://issues.apache.org/jira/browse/SPARK-614 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Denny Britz > > In mesos mode, the work log directories are monotonically increasing, and > makes it very easy to spot a folder and go into it (e.g. only need to type > *[last4digit]). > We lost this in the standalone mode, as seen in this example. The last four > digits would go up and down > drwxr-xr-x 3 root root 4096 Nov 8 08:03 job-20121108080355- > drwxr-xr-x 3 root root 4096 Nov 8 08:04 job-20121108080450-0001 > drwxr-xr-x 3 root root 4096 Nov 8 08:07 job-20121108080757-0002 > drwxr-xr-x 3 root root 4096 Nov 8 08:10 job-20121108081014-0003 > drwxr-xr-x 3 root root 4096 Nov 8 08:23 job-20121108082316-0004 > drwxr-xr-x 3 root root 4096 Nov 8 08:26 job-20121108082616-0005 > drwxr-xr-x 3 root root 4096 Nov 8 08:30 job-20121108083034-0006 > drwxr-xr-x 3 root root 4096 Nov 8 08:35 job-20121108083514-0007 > drwxr-xr-x 3 root root 4096 Nov 8 08:38 job-20121108083807-0008 > drwxr-xr-x 3 root root 4096 Nov 8 08:41 job-20121108084105-0009 > drwxr-xr-x 3 root root 4096 Nov 8 08:42 job-20121108084242-0010 > drwxr-xr-x 3 root root 4096 Nov 8 08:45 job-20121108084512-0011 > drwxr-xr-x 3 root root 4096 Nov 8 09:01 job-20121108090113- > drwxr-xr-x 3 root root 4096 Nov 8 09:15 job-20121108091536-0001 > drwxr-xr-x 3 root root 4096 Nov 8 09:24 job-20121108092341-0003 > drwxr-xr-x 3 root root 4096 Nov 8 09:27 job-20121108092703- > drwxr-xr-x 3 root root 4096 Nov 8 09:46 job-20121108094629-0001 > drwxr-xr-x 3 root root 4096 Nov 8 09:48 job-20121108094809-0002 > drwxr-xr-x 3 root root 4096 Nov 8 10:04 job-20121108100418-0003 > drwxr-xr-x 3 root root 4096 Nov 8 10:18 job-20121108101814-0004 > drwxr-xr-x 3 root root 4096 Nov 8 10:22 job-20121108102207-0005 > drwxr-xr-x 3 root root 4096 Nov 8 18:48 job-20121108184842-0006 > drwxr-xr-x 3 root root 4096 Nov 8 18:49 job-20121108184932-0007 > drwxr-xr-x 3 root root 4096 Nov 8 18:50 job-20121108185007-0008 > drwxr-xr-x 3 root root 4096 Nov 8 18:50 job-20121108185040-0009 > drwxr-xr-x 3 root root 4096 Nov 8 18:51 job-20121108185127-0010 > drwxr-xr-x 3 root root 4096 Nov 8 18:54 job-20121108185428-0011 > drwxr-xr-x 3 root root 4096 Nov 8 18:58 job-20121108185837-0012 > drwxr-xr-x 3 root root 4096 Nov 8 18:58 job-20121108185854-0013 > drwxr-xr-x 3 root root 4096 Nov 8 19:00 job-20121108190005-0014 > drwxr-xr-x 3 root root 4096 Nov 8 19:00 job-20121108190059-0015 > drwxr-xr-x 3 root root 4096 Nov 8 19:10 job-20121108191010-0016 > drwxr-xr-x 3 root root 4096 Nov 8 19:15 job-20121108191508-0017 > drwxr-xr-x 3 root root 4096 Nov 8 19:21 job-20121108192125-0018 > drwxr-xr-x 3 root root 4096 Nov 8 19:23 job-20121108192329-0019 > drwxr-xr-x 3 root root 4096 Nov 8 19:26 job-20121108192638-0020 > drwxr-xr-x 3 root root 4096 Nov 8 19:35 job-20121108193554-0022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1748) I installed the spark_standalone,but I did not know how to use sbt to compile the programme of spark?
[ https://issues.apache.org/jira/browse/SPARK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142455#comment-14142455 ] Matthew Farrellee commented on SPARK-1748: -- thanks for the question. you'll get a better response asking on the mailing lists, see http://spark.apache.org/community.html, so i'm going to close this out. > I installed the spark_standalone,but I did not know how to use sbt to compile > the programme of spark? > - > > Key: SPARK-1748 > URL: https://issues.apache.org/jira/browse/SPARK-1748 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 0.8.1 > Environment: spark standalone >Reporter: lxflyl > > I installed the mode of spark standalone ,but I did not understand how to use > sbt to compile the program of spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1748) I installed the spark_standalone,but I did not know how to use sbt to compile the programme of spark?
[ https://issues.apache.org/jira/browse/SPARK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-1748. Resolution: Done Fix Version/s: (was: 0.8.1) > I installed the spark_standalone,but I did not know how to use sbt to compile > the programme of spark? > - > > Key: SPARK-1748 > URL: https://issues.apache.org/jira/browse/SPARK-1748 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 0.8.1 > Environment: spark standalone >Reporter: lxflyl > > I installed the mode of spark standalone ,but I did not understand how to use > sbt to compile the program of spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1176) Adding port configuration for HttpBroadcast
[ https://issues.apache.org/jira/browse/SPARK-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee resolved SPARK-1176. -- Resolution: Fixed Fix Version/s: (was: 0.9.0) 1.1.0 > Adding port configuration for HttpBroadcast > --- > > Key: SPARK-1176 > URL: https://issues.apache.org/jira/browse/SPARK-1176 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Egor Pakhomov >Priority: Minor > Fix For: 1.1.0 > > > I run spark in big organization, where to open port accessible to other > computers in network, I need to create a ticket on DevOps and it executes for > days. I can't have port for some spark service to be changed all the time. I > need ability to configure this port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1176) Adding port configuration for HttpBroadcast
[ https://issues.apache.org/jira/browse/SPARK-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142454#comment-14142454 ] Matthew Farrellee commented on SPARK-1176: -- [~epakhomov] it looks like this was resolved by SPARK-2157. i'm going to close this, but please feel free to re-open if it is still an issue for you. > Adding port configuration for HttpBroadcast > --- > > Key: SPARK-1176 > URL: https://issues.apache.org/jira/browse/SPARK-1176 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Egor Pakhomov >Priority: Minor > Fix For: 0.9.0 > > > I run spark in big organization, where to open port accessible to other > computers in network, I need to create a ticket on DevOps and it executes for > days. I can't have port for some spark service to be changed all the time. I > need ability to configure this port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1177) Allow SPARK_JAR to be set in system properties
[ https://issues.apache.org/jira/browse/SPARK-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee closed SPARK-1177. Resolution: Fixed Fix Version/s: (was: 0.9.0) > Allow SPARK_JAR to be set in system properties > -- > > Key: SPARK-1177 > URL: https://issues.apache.org/jira/browse/SPARK-1177 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Egor Pakhomov >Priority: Minor > > I'd like to be able to do from my scala code: > System.setProperty("SPARK_YARN_APP_JAR", > SparkContext.jarOfClass(this.getClass).head) > System.setProperty("SPARK_JAR", > SparkContext.jarOfClass(SparkContext.getClass).head) > And do nothing on OS level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1177) Allow SPARK_JAR to be set in system properties
[ https://issues.apache.org/jira/browse/SPARK-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142447#comment-14142447 ] Matthew Farrellee commented on SPARK-1177: -- [~epakhomov] it looks like this has been resolved in other change, for instance being able to use spark.yarn.jar. i'm going to close this, but feel free to re-open if you think it is still important. > Allow SPARK_JAR to be set in system properties > -- > > Key: SPARK-1177 > URL: https://issues.apache.org/jira/browse/SPARK-1177 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Egor Pakhomov >Priority: Minor > Fix For: 0.9.0 > > > I'd like to be able to do from my scala code: > System.setProperty("SPARK_YARN_APP_JAR", > SparkContext.jarOfClass(this.getClass).head) > System.setProperty("SPARK_JAR", > SparkContext.jarOfClass(SparkContext.getClass).head) > And do nothing on OS level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1443) Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API
[ https://issues.apache.org/jira/browse/SPARK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142446#comment-14142446 ] Matthew Farrellee commented on SPARK-1443: -- [~PavanKumarVarma] i hope you've been able to resolve your issue over the past 5 months. since you'll get a better response asking on the spark user list than in jira, see http://spark.apache.org/community.html, i'm going to close this out. > Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API > -- > > Key: SPARK-1443 > URL: https://issues.apache.org/jira/browse/SPARK-1443 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API, Spark Core >Affects Versions: 0.9.0 > Environment: Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4, >Reporter: Pavan Kumar Varma >Priority: Critical > Labels: GridFS, MongoDB, Spark, hadoop2, java > Original Estimate: 12h > Remaining Estimate: 12h > > I saved a 2GB pdf file into MongoDB using GridFS. now i want process those > GridFS collection data using Java Spark Mapreduce API. previously i have > successfully processed mongoDB collections with Apache spark using > Mongo-Hadoop connector. now i'm unable to GridFS collections with the > following code. > MongoConfigUtil.setInputURI(config, > "mongodb://localhost:27017/pdfbooks.fs.chunks" ); > MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output ); > JavaPairRDD mongoRDD = sc.newAPIHadoopRDD(config, > com.mongodb.hadoop.MongoInputFormat.class, Object.class, > BSONObject.class); > JavaRDD words = mongoRDD.flatMap(new > FlatMapFunction, >String>() { >@Override >public Iterable call(Tuple2 arg) { >System.out.println(arg._2.toString()); >... > Please suggest/provide better API methods to access MongoDB GridFS data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1443) Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API
[ https://issues.apache.org/jira/browse/SPARK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee resolved SPARK-1443. -- Resolution: Done Fix Version/s: (was: 0.9.0) > Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API > -- > > Key: SPARK-1443 > URL: https://issues.apache.org/jira/browse/SPARK-1443 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API, Spark Core >Affects Versions: 0.9.0 > Environment: Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4, >Reporter: Pavan Kumar Varma >Priority: Critical > Labels: GridFS, MongoDB, Spark, hadoop2, java > Original Estimate: 12h > Remaining Estimate: 12h > > I saved a 2GB pdf file into MongoDB using GridFS. now i want process those > GridFS collection data using Java Spark Mapreduce API. previously i have > successfully processed mongoDB collections with Apache spark using > Mongo-Hadoop connector. now i'm unable to GridFS collections with the > following code. > MongoConfigUtil.setInputURI(config, > "mongodb://localhost:27017/pdfbooks.fs.chunks" ); > MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output ); > JavaPairRDD mongoRDD = sc.newAPIHadoopRDD(config, > com.mongodb.hadoop.MongoInputFormat.class, Object.class, > BSONObject.class); > JavaRDD words = mongoRDD.flatMap(new > FlatMapFunction, >String>() { >@Override >public Iterable call(Tuple2 arg) { >System.out.println(arg._2.toString()); >... > Please suggest/provide better API methods to access MongoDB GridFS data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages
[ https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142445#comment-14142445 ] Apache Spark commented on SPARK-3580: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2478 > Add Consistent Method To Get Number of RDD Partitions Across Different > Languages > > > Key: SPARK-3580 > URL: https://issues.apache.org/jira/browse/SPARK-3580 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Pat McDonough > Labels: starter > > Programmatically retrieving the number of partitions is not consistent > between python and scala. A consistent method should be defined and made > public across both languages. > RDD.partitions.size is also used quite frequently throughout the internal > code, so that might be worth refactoring as well once the new method is > available. > What we have today is below. > In Scala: > {code} > scala> someRDD.partitions.size > res0: Int = 30 > {code} > In Python: > {code} > In [2]: someRDD.getNumPartitions() > Out[2]: 30 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3624) "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages
[ https://issues.apache.org/jira/browse/SPARK-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142412#comment-14142412 ] Apache Spark commented on SPARK-3624: - User 'tzolov' has created a pull request for this issue: https://github.com/apache/spark/pull/2477 > "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian > packages > > > Key: SPARK-3624 > URL: https://issues.apache.org/jira/browse/SPARK-3624 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 1.1.0 >Reporter: Christian Tzolov >Priority: Minor > > The compute-classpath.sh requires that for a 'RELASED' package the Spark > assembly jar is accessible from a /lib folder. > Currently the jdeb packaging (assembly module) bundles the assembly jar into > a folder called 'jars'. > The result is : > /usr/share/spark/bin/spark-submit --num-executors 10--master > yarn-cluster --class org.apache.spark.examples.SparkPi > /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10 > ls: cannot access /usr/share/spark/lib: No such file or directory > Failed to find Spark assembly in /usr/share/spark/lib > You need to build Spark before running this program. > Trivial solution is to rename the '${deb.install.path}/jars' > inside assembly/pom.xml to ${deb.install.path}/lib. > Another less impactful (considering backward compatibility) solution is to > define a lib->jars symlink in the assembly/pom.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3624) "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages
Christian Tzolov created SPARK-3624: --- Summary: "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages Key: SPARK-3624 URL: https://issues.apache.org/jira/browse/SPARK-3624 Project: Spark Issue Type: Bug Components: Build, Deploy Affects Versions: 1.1.0 Reporter: Christian Tzolov Priority: Minor The compute-classpath.sh requires that for a 'RELASED' package the Spark assembly jar is accessible from a /lib folder. Currently the jdeb packaging (assembly module) bundles the assembly jar into a folder called 'jars'. The result is : /usr/share/spark/bin/spark-submit --num-executors 10--master yarn-cluster --class org.apache.spark.examples.SparkPi /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10 ls: cannot access /usr/share/spark/lib: No such file or directory Failed to find Spark assembly in /usr/share/spark/lib You need to build Spark before running this program. Trivial solution is to rename the '${deb.install.path}/jars' inside assembly/pom.xml to ${deb.install.path}/lib. Another less impactful (considering backward compatibility) solution is to define a lib->jars symlink in the assembly/pom.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-3620: --- Description: I'm proposing its time to refactor the configuration argument handling code in spark-submit. The code has grown organically in a short period of time, handles a pretty complicated logic flow, and is now pretty fragile. Some issues that have been identified: 1. Hand-crafted property file readers that do not support the property file format as specified in http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) 2. ResolveURI not called on paths read from conf/prop files 3. inconsistent means of merging / overriding values from different sources (Some get overridden by file, others by manual settings of field on object, Some by properties) 4. Argument validation should be done after combining config files, system properties and command line arguments, 5. Alternate conf file location not handled in shell scripts 6. Some options can only be passed as command line arguments 7. Defaults for options are hard-coded (and sometimes overridden multiple times) in many through-out the code e.g. master = local[*] Initial proposal is to use typesafe conf to read in the config information and merge the various config sources was: I'm proposing its time to refactor the configuration argument handling code in spark-submit. The code has grown organically in a short period of time, handles a pretty complicated logic flow, and is now pretty fragile. Some issues that have been identified: 1. Hand-crafted property file readers that do not support the property file format as specified in http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) 2. ResolveURI not called on paths read from conf/prop files 3. inconsistent means of merging / overriding values from different sources (Some get overridden by file, others by manual settings of field on object, Some by properties) 4. Argument validation should be done after combining config files, system properties and command line arguments, 5. Alternate conf file location not handled in shell scripts 6. Some options can only be passed as command line arguments 7. Defaults for options are hard-coded (and sometimes overridden multiple times) in many through-out the code e.g. master = local[*] > Refactor config option handling code for spark-submit > - > > Key: SPARK-3620 > URL: https://issues.apache.org/jira/browse/SPARK-3620 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.0, 1.1.0 >Reporter: Dale Richardson >Assignee: Dale Richardson >Priority: Minor > > I'm proposing its time to refactor the configuration argument handling code > in spark-submit. The code has grown organically in a short period of time, > handles a pretty complicated logic flow, and is now pretty fragile. Some > issues that have been identified: > 1. Hand-crafted property file readers that do not support the property file > format as specified in > http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) > 2. ResolveURI not called on paths read from conf/prop files > 3. inconsistent means of merging / overriding values from different sources > (Some get overridden by file, others by manual settings of field on object, > Some by properties) > 4. Argument validation should be done after combining config files, system > properties and command line arguments, > 5. Alternate conf file location not handled in shell scripts > 6. Some options can only be passed as command line arguments > 7. Defaults for options are hard-coded (and sometimes overridden multiple > times) in many through-out the code e.g. master = local[*] > Initial proposal is to use typesafe conf to read in the config information > and merge the various config sources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142394#comment-14142394 ] Dale Richardson edited comment on SPARK-3620 at 9/21/14 9:30 AM: - Seems to be discussion about moving to typesafe config and back again for version 0.9 http://apache-spark-developers-list.1001551.n3.nabble.com/Moving-to-Typesafe-Config-td381.html was (Author: tigerquoll): Seems to be discussion about moving to typesafe config and back again at http://apache-spark-developers-list.1001551.n3.nabble.com/Moving-to-Typesafe-Config-td381.html > Refactor config option handling code for spark-submit > - > > Key: SPARK-3620 > URL: https://issues.apache.org/jira/browse/SPARK-3620 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.0, 1.1.0 >Reporter: Dale Richardson >Assignee: Dale Richardson >Priority: Minor > > I'm proposing its time to refactor the configuration argument handling code > in spark-submit. The code has grown organically in a short period of time, > handles a pretty complicated logic flow, and is now pretty fragile. Some > issues that have been identified: > 1. Hand-crafted property file readers that do not support the property file > format as specified in > http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) > 2. ResolveURI not called on paths read from conf/prop files > 3. inconsistent means of merging / overriding values from different sources > (Some get overridden by file, others by manual settings of field on object, > Some by properties) > 4. Argument validation should be done after combining config files, system > properties and command line arguments, > 5. Alternate conf file location not handled in shell scripts > 6. Some options can only be passed as command line arguments > 7. Defaults for options are hard-coded (and sometimes overridden multiple > times) in many through-out the code e.g. master = local[*] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142395#comment-14142395 ] Dale Richardson commented on SPARK-3620: Also somes notes about config properties that do not have unique prefixes at http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html Its seems the following options have non-unique prefixes, which means that some typesafe conf functionality may be broken spark.locality.wait spark.locality.wait.node spark.locality.wait.process spark.locality.wait.rack spark.speculation spark.speculation.interval spark.speculation.multiplier spark.speculation.quantile > Refactor config option handling code for spark-submit > - > > Key: SPARK-3620 > URL: https://issues.apache.org/jira/browse/SPARK-3620 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.0, 1.1.0 >Reporter: Dale Richardson >Assignee: Dale Richardson >Priority: Minor > > I'm proposing its time to refactor the configuration argument handling code > in spark-submit. The code has grown organically in a short period of time, > handles a pretty complicated logic flow, and is now pretty fragile. Some > issues that have been identified: > 1. Hand-crafted property file readers that do not support the property file > format as specified in > http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) > 2. ResolveURI not called on paths read from conf/prop files > 3. inconsistent means of merging / overriding values from different sources > (Some get overridden by file, others by manual settings of field on object, > Some by properties) > 4. Argument validation should be done after combining config files, system > properties and command line arguments, > 5. Alternate conf file location not handled in shell scripts > 6. Some options can only be passed as command line arguments > 7. Defaults for options are hard-coded (and sometimes overridden multiple > times) in many through-out the code e.g. master = local[*] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142394#comment-14142394 ] Dale Richardson commented on SPARK-3620: Seems to be discussion about moving to typesafe config and back again at http://apache-spark-developers-list.1001551.n3.nabble.com/Moving-to-Typesafe-Config-td381.html > Refactor config option handling code for spark-submit > - > > Key: SPARK-3620 > URL: https://issues.apache.org/jira/browse/SPARK-3620 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.0, 1.1.0 >Reporter: Dale Richardson >Assignee: Dale Richardson >Priority: Minor > > I'm proposing its time to refactor the configuration argument handling code > in spark-submit. The code has grown organically in a short period of time, > handles a pretty complicated logic flow, and is now pretty fragile. Some > issues that have been identified: > 1. Hand-crafted property file readers that do not support the property file > format as specified in > http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) > 2. ResolveURI not called on paths read from conf/prop files > 3. inconsistent means of merging / overriding values from different sources > (Some get overridden by file, others by manual settings of field on object, > Some by properties) > 4. Argument validation should be done after combining config files, system > properties and command line arguments, > 5. Alternate conf file location not handled in shell scripts > 6. Some options can only be passed as command line arguments > 7. Defaults for options are hard-coded (and sometimes overridden multiple > times) in many through-out the code e.g. master = local[*] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3623) Graph should support the checkpoint operation
[ https://issues.apache.org/jira/browse/SPARK-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3623: --- Summary: Graph should support the checkpoint operation (was: GraphX does not support the checkpoint operation) > Graph should support the checkpoint operation > - > > Key: SPARK-3623 > URL: https://issues.apache.org/jira/browse/SPARK-3623 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Priority: Critical > > Consider the following code: > {code} > for (i <- 0 until totalIter) { > val previousCorpus = corpus > logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) > val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, > sumTerms, > numTerms, numTopics, alpha, beta).persist(storageLevel) > val corpusSampleTopics = sampleTopics(corpusTopicDist, > globalTopicCounter, sumTerms, numTerms, > numTopics, alpha, beta).persist(storageLevel) > corpus = updateCounter(corpusSampleTopics, > numTopics).persist(storageLevel) > globalTopicCounter = collectGlobalCounter(corpus, numTopics) > assert(bsum(globalTopicCounter) == sumTerms) > previousCorpus.unpersistVertices() > corpusTopicDist.unpersistVertices() > corpusSampleTopics.unpersistVertices() > } > {code} > If there is no checkpoint operation will appear the following problems. > 1. The RDD of corpus dependencies are too deep > 2. The shuffle files are too large. > 3. Any of a server crash will cause the algorithm to recalculate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142383#comment-14142383 ] Sean Owen commented on SPARK-3621: -- My understanding is that this is fairly fundamentally not possible in Spark. The metadata and machinery necessary to operate on RDDs is with the driver. RDDs are not accessible within transformations or actions. I'm interested both in whether that is in fact true, how much of an issue it really is for Hive-on-Spark to use collect + broadcast, and whether these sorts of requirements can be met with join, cogroup, etc. > Provide a way to broadcast an RDD (instead of just a variable made of the > RDD) so that a job can access > --- > > Key: SPARK-3621 > URL: https://issues.apache.org/jira/browse/SPARK-3621 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Xuefu Zhang > > In some cases, such as Hive's way of doing map-side join, it would be > benefcial to allow client program to broadcast RDDs rather than just > variables made of these RDDs. Broadcasting a variable made of RDDs requires > all RDD data be collected to the driver and that the variable be shipped to > the cluster after being made. It would be more performing if driver just > broadcasts the RDDs and uses the corresponding data in jobs (such building > hashmaps at executors). > Tez has a broadcast edge which can ship data from previous stage to the next > stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3623) GraphX does not support the checkpoint operation
[ https://issues.apache.org/jira/browse/SPARK-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3623: --- Description: Consider the following code: {code} for (i <- 0 until totalIter) { val previousCorpus = corpus logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) corpus = updateCounter(corpusSampleTopics, numTopics).persist(storageLevel) globalTopicCounter = collectGlobalCounter(corpus, numTopics) assert(bsum(globalTopicCounter) == sumTerms) previousCorpus.unpersistVertices() corpusTopicDist.unpersistVertices() corpusSampleTopics.unpersistVertices() } {code} If there is no checkpoint operation will appear the following problems. 1. The RDD of corpus dependencies are too deep 2. The shuffle files are too large. 3. Any of a server crash will cause the algorithm to recalculate was: Consider the following code: {code} for (i <- 0 until totalIter) { val previousCorpus = corpus logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) corpus = updateCounter(corpusSampleTopics, numTopics).persist(storageLevel) globalTopicCounter = collectGlobalCounter(corpus, numTopics) assert(bsum(globalTopicCounter) == sumTerms) previousCorpus.unpersistVertices() corpusTopicDist.unpersistVertices() corpusSampleTopics.unpersistVertices() } {code} If there is no checkpoint operation will appear the following problems. 1. The RDD of corpus dependencies are too deep 2. The shuffle files are too large. > GraphX does not support the checkpoint operation > > > Key: SPARK-3623 > URL: https://issues.apache.org/jira/browse/SPARK-3623 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Priority: Critical > > Consider the following code: > {code} > for (i <- 0 until totalIter) { > val previousCorpus = corpus > logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) > val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, > sumTerms, > numTerms, numTopics, alpha, beta).persist(storageLevel) > val corpusSampleTopics = sampleTopics(corpusTopicDist, > globalTopicCounter, sumTerms, numTerms, > numTopics, alpha, beta).persist(storageLevel) > corpus = updateCounter(corpusSampleTopics, > numTopics).persist(storageLevel) > globalTopicCounter = collectGlobalCounter(corpus, numTopics) > assert(bsum(globalTopicCounter) == sumTerms) > previousCorpus.unpersistVertices() > corpusTopicDist.unpersistVertices() > corpusSampleTopics.unpersistVertices() > } > {code} > If there is no checkpoint operation will appear the following problems. > 1. The RDD of corpus dependencies are too deep > 2. The shuffle files are too large. > 3. Any of a server crash will cause the algorithm to recalculate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3623) GraphX does not support the checkpoint operation
Guoqiang Li created SPARK-3623: -- Summary: GraphX does not support the checkpoint operation Key: SPARK-3623 URL: https://issues.apache.org/jira/browse/SPARK-3623 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.1.0, 1.0.2 Reporter: Guoqiang Li Priority: Critical Consider the following code: {code} for (i <- 0 until totalIter) { val previousCorpus = corpus logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) corpus = updateCounter(corpusSampleTopics, numTopics).persist(storageLevel) globalTopicCounter = collectGlobalCounter(corpus, numTopics) assert(bsum(globalTopicCounter) == sumTerms) previousCorpus.unpersistVertices() corpusTopicDist.unpersistVertices() corpusSampleTopics.unpersistVertices() } {code} If there is no checkpoint operation will appear the following problems. 1. The RDD of corpus dependencies are too deep 2. The shuffle files are too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org