[jira] [Updated] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-09-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3633:
---
Summary: Fetches failure observed after SPARK-2711  (was: PR 1707/commit 
#4fde28c is problematic)

> Fetches failure observed after SPARK-2711
> -
>
> Key: SPARK-3633
> URL: https://issues.apache.org/jira/browse/SPARK-3633
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.1.0
>Reporter: Nishkam Ravi
>
> Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
> Recently upgraded to Spark 1.1. The workload fails with the following error 
> message(s):
> {code}
> 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
> c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
> c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
> 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
> {code}
> In order to identify the problem, I carried out change set analysis. As I go 
> back in time, the error message changes to:
> {code}
> 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
> c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
> /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
>  (Too many open files)
> java.io.FileOutputStream.open(Native Method)
> java.io.FileOutputStream.(FileOutputStream.java:221)
> 
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
> 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules

2014-09-21 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3634:
-

 Summary: Python modules added through addPyFile should take 
precedence over system modules
 Key: SPARK-3634
 URL: https://issues.apache.org/jira/browse/SPARK-3634
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.1.0, 1.0.2
Reporter: Josh Rosen


Python modules added through {{SparkContext.addPyFile()}} are currently 
_appended_ to {{sys.path}}; this is probably the opposite of the behavior that 
we want, since it causes system versions of modules to take precedence over 
versions explicitly added by users.

To fix this, we should change the {{sys.path}} manipulation code in 
{{context.py}} and {{worker.py}} to prepend files to {{sys.path}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142881#comment-14142881
 ] 

Xuefu Zhang edited comment on SPARK-3622 at 9/22/14 4:39 AM:
-

Thanks for your comments, [~pwendell]. I understand caching A would be helpful 
if I need to transform it to get B and C separately. My proposal is to get B 
and C just by one pass of A, so A doens't even need to be cached.

Here is an example how it may be used in Hive.
{code}
JavaPairRDD table = sparkContext.hadoopRDD(..);
Map mappedRDDs = table.mapPartitions(mapFunction);
JavaPairRDD rddA = mapperRDDs.get("A");
JavaPairRDD rddB = mapperRDDs.get("B");
JavaPairRDD sortedRddA = rddA.sortByKey();
javaPairRDD groupedRddB = rddB.groupByKey();
// further processing sortedRddA and groupedRddB.
...
{code}
In this case, mapFunction can return named iterators for A and B. B is 
automatically computed whenever A is computed, and vice versa. Since both are 
computed if any of them computed, subsequent reference to either one should not 
recompute any of them.

The benefits of it: 1) no need to cache A; 2) only one pass of the input.

I'm not sure if this is possible feasible in Spark, but Hive's map function is 
exactly doing this. It's operator tree can branch off anywhere, resulting 
multiple output datasets from a single input dataset.

Please let me know if there are more questions.



was (Author: xuefuz):
Thanks for your comments, [~pwendell]. I understand caching A would be helpful 
if I need to transform it to get B and C separately. My proposal is to get B 
and C just by one pass of A, so A doens't even need to be cached.

Here is an example how it may be used in Hive.
{code}
JavaPairRDD table = sparkContext.hadoopRDD(..);
Map mappedRDDs = table.mapPartitions(mapFunction);
JavaPairRDD rddA = mapperRDDs.get("A");
JavaPairRDD rddB = mapperRDDs.get("A");
JavaPairRDD sortedRddA = rddA.sortByKey();
javaPairRDD groupedRddB = rddB.groupByKey();
// further processing sortedRddA and groupedRddB.
...
{code}
In this case, mapFunction can return named iterators for A and B. B is 
automatically computed whenever A is computed, and vice versa. Since both are 
computed if any of them computed, subsequent reference to either one should not 
recompute any of them.

The benefits of it: 1) no need to cache A; 2) only one pass of the input.

I'm not sure if this is possible feasible in Spark, but Hive's map function is 
exactly doing this. It's operator tree can branch off anywhere, resulting 
multiple output datasets from a single input dataset.

Please let me know if there are more questions.


> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142881#comment-14142881
 ] 

Xuefu Zhang commented on SPARK-3622:


Thanks for your comments, [~pwendell]. I understand caching A would be helpful 
if I need to transform it to get B and C separately. My proposal is to get B 
and C just by one pass of A, so A doens't even need to be cached.

Here is an example how it may be used in Hive.
{code}
JavaPairRDD table = sparkContext.hadoopRDD(..);
Map mappedRDDs = table.mapPartitions(mapFunction);
JavaPairRDD rddA = mapperRDDs.get("A");
JavaPairRDD rddB = mapperRDDs.get("A");
JavaPairRDD sortedRddA = rddA.sortByKey();
javaPairRDD groupedRddB = rddB.groupByKey();
// further processing sortedRddA and groupedRddB.
...
{code}
In this case, mapFunction can return named iterators for A and B. B is 
automatically computed whenever A is computed, and vice versa. Since both are 
computed if any of them computed, subsequent reference to either one should not 
recompute any of them.

The benefits of it: 1) no need to cache A; 2) only one pass of the input.

I'm not sure if this is possible feasible in Spark, but Hive's map function is 
exactly doing this. It's operator tree can branch off anywhere, resulting 
multiple output datasets from a single input dataset.

Please let me know if there are more questions.


> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3633) PR 1707/commit #4fde28c is problematic

2014-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3633:
---
Description: 
Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
Recently upgraded to Spark 1.1. The workload fails with the following error 
message(s):

{code}
14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)

14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
{code}

In order to identify the problem, I carried out change set analysis. As I go 
back in time, the error message changes to:

{code}
14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
/var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
 (Too many open files)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)

org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 

  was:
Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
Recently upgraded to Spark 1.1. The workload fails with the following error 
message(s):

14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)

14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages

In order to identify the problem, I carried out change set analysis. As I go 
back in time, the error message changes to:

14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
/var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
 (Too many open files)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)

org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 


> PR 1707/commit #4fde28c is problematic
> --
>
> Key: SPARK-3633
> URL: https://issues.apache.org/jira/browse/SPARK-3633
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.1.0
>Reporter: Nishkam Ravi
>
> Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
> Recently upgraded to Spark 1.1. The workload fails with the following error 
> message(s):
> {code}
>

[jira] [Created] (SPARK-3633) PR 1707/commit #4fde28c is problematic

2014-09-21 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-3633:
---

 Summary: PR 1707/commit #4fde28c is problematic
 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi


Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
Recently upgraded to Spark 1.1. The workload fails with the following error 
message(s):

14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)

14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages

In order to identify the problem, I carried out change set analysis. As I go 
back in time, the error message changes to:

14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
/var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
 (Too many open files)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)

org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142866#comment-14142866
 ] 

Apache Spark commented on SPARK-3632:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2484

> ConnectionManager can run out of receive threads with authentication on
> ---
>
> Key: SPARK-3632
> URL: https://issues.apache.org/jira/browse/SPARK-3632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> If you turn authentication on and you are using a lot of executors. There is 
> a chance that all the of the threads in the handleMessageExecutor could be 
> waiting to send a message because they are blocked waiting on authentication 
> to happen. This can cause a temporary deadlock until the connection times out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865
 ] 

Patrick Wendell edited comment on SPARK-3622 at 9/22/14 3:24 AM:
-

Do you mind clarifying a little bit how hive would use this (maybe with a code 
example)?

Let's say you had a transformation that went from a single RDD A to two RDD's B 
and C. The normal way to do this if you want to avoid recomputing A would be to 
persist it, then use it to derive both B and C (this will do multiple passes on 
A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. The 
user might use B and C at different times, so it's not possible to guarantee 
that A will be computed only once unless you persist A.


was (Author: pwendell):
Do you mind clarifying a little bit how hive would use this (maybe with a code 
example)? The normal way to do this if you want to avoid recomputing A would be 
to persist it, then use it to derive both B and C (this will do multiple passes 
on A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. 
Let's say you had a transformation that went from a single RDD A to two RDD's B 
and C. The user might use B and C at different times, so it's not possible to 
guarantee that A will be computed only once unless you persist A.

> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142865#comment-14142865
 ] 

Patrick Wendell commented on SPARK-3622:


Do you mind clarifying a little bit how hive would use this (maybe with a code 
example)? The normal way to do this if you want to avoid recomputing A would be 
to persist it, then use it to derive both B and C (this will do multiple passes 
on A, but it won't fully recompute A twice).

I think that doing this in the general case is not possible by definition. 
Let's say you had a transformation that went from a single RDD A to two RDD's B 
and C. The user might use B and C at different times, so it's not possible to 
guarantee that A will be computed only once unless you persist A.

> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860
 ] 

RJ Nowling commented on SPARK-3614:
---

Thanks, Andrew! I'll do that.




-- 
em rnowl...@gmail.com
c 954.496.2314


> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on

2014-09-21 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3632:


 Summary: ConnectionManager can run out of receive threads with 
authentication on
 Key: SPARK-3632
 URL: https://issues.apache.org/jira/browse/SPARK-3632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical


If you turn authentication on and you are using a lot of executors. There is a 
chance that all the of the threads in the handleMessageExecutor could be 
waiting to send a message because they are blocked waiting on authentication to 
happen. This can cause a temporary deadlock until the connection times out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3615) Kafka test should not hard code Zookeeper port

2014-09-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142855#comment-14142855
 ] 

Saisai Shao commented on SPARK-3615:


Hi Patrick, I've submit a PR to fix this issue, mind taking a look at the PR? 
Thanks a lot.

> Kafka test should not hard code Zookeeper port
> --
>
> Key: SPARK-3615
> URL: https://issues.apache.org/jira/browse/SPARK-3615
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Saisai Shao
>Priority: Blocker
>
> This is causing failures in our master build if port 2181 is contented. 
> Instead of binding to a static port we should re-factor this such that it 
> opens a socket on port 0 and then reads back the port. So we can never have 
> contention.
> {code}
> sbt.ForkMain$ForkError: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:444)
>   at sun.nio.ch.Net.bind(Net.java:436)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95)
>   at 
> org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.(KafkaStreamSuite.scala:200)
>   at 
> org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62)
>   at 
> org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:24)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
>   at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90)
>   at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223)
>   at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3615) Kafka test should not hard code Zookeeper port

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142847#comment-14142847
 ] 

Apache Spark commented on SPARK-3615:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/2483

> Kafka test should not hard code Zookeeper port
> --
>
> Key: SPARK-3615
> URL: https://issues.apache.org/jira/browse/SPARK-3615
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Saisai Shao
>Priority: Blocker
>
> This is causing failures in our master build if port 2181 is contented. 
> Instead of binding to a static port we should re-factor this such that it 
> opens a socket on port 0 and then reads back the port. So we can never have 
> contention.
> {code}
> sbt.ForkMain$ForkError: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:444)
>   at sun.nio.ch.Net.bind(Net.java:436)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95)
>   at 
> org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.(KafkaStreamSuite.scala:200)
>   at 
> org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62)
>   at 
> org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:24)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
>   at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90)
>   at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223)
>   at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3631) Add docs for checkpoint usage

2014-09-21 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3631:
-

 Summary: Add docs for checkpoint usage
 Key: SPARK-3631
 URL: https://issues.apache.org/jira/browse/SPARK-3631
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Ash
Assignee: Andrew Ash


We should include general documentation on using checkpoints.  Right now the 
docs only cover checkpoints in the Spark Streaming use case which is slightly 
different from Core.

Some content to consider for inclusion from [~brkyvz]:

{quote}
If you set the checkpointing directory however, the intermediate state of the 
RDDs will be saved in HDFS, and the lineage will pick off from there.
You won't need to keep the shuffle data before the checkpointed state, 
therefore those can be safely removed (will be removed automatically).
However, checkpoint must be called explicitly as in 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
 ,just setting the directory will not be enough.
{quote}

{quote}
Yes, writing to HDFS is more expensive, but I feel it is still a small price to 
pay when compared to having a Disk Space Full error three hours in
and having to start from scratch.

The main goal of checkpointing is to truncate the lineage. Clearing up shuffle 
writes come as a bonus to checkpointing, it is not the main goal. The
subtlety here is that .checkpoint() is just like .cache(). Until you call an 
action, nothing happens. Therefore, if you're going to do 1000 maps in a
row and you don't want to checkpoint in the meantime until a shuffle happens, 
you will still get a StackOverflowError, because the lineage is too long.

I went through some of the code for checkpointing. As far as I can tell, it 
materializes the data in HDFS, and resets all its dependencies, so you start
a fresh lineage. My understanding would be that checkpointing still should be 
done every N operations to reset the lineage. However, an action must be
performed before the lineage grows too long.
{quote}

A good place to put this information would be at 
https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-559) Automatically register all classes used in fields of a class with Kryo

2014-09-21 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142830#comment-14142830
 ] 

Andrew Ash commented on SPARK-559:
--

As of today in master we're using Twitter Chill version 0.3.6 which includes 
Kryo 2.21, so we are on the 2.x branch now

> Automatically register all classes used in fields of a class with Kryo
> --
>
> Key: SPARK-559
> URL: https://issues.apache.org/jira/browse/SPARK-559
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3577) Add task metric to report spill time

2014-09-21 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3577:
--
Description: The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} 
into {{ExternalSorter}}.  The write time recorded in those metrics is never 
used.  We should probably add task metrics to report this spill time, since for 
shuffles, this would have previously been reported as part of shuffle write 
time (with the original hash-based sorter).  (was: The ExternalSorter passes 
its own ShuffleWriteMetrics into ExternalSorter.  The write time recorded in 
those metrics is never used.  We should probably add task metrics to report 
this spill time, since for shuffles, this would have previously been reported 
as part of shuffle write time (with the original hash-based sorter).)

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-21 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142827#comment-14142827
 ] 

Andrew Ash commented on SPARK-3614:
---

Great! I assigned this ticket to you RJ.  Please try to have a draft commit 
within a couple weeks for review so others who might want to work on this can 
see progress being made.  Otherwise it's best to leave tickets unassigned while 
no one is actively working on them.

Thanks!

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-21 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3614:
--
Assignee: RJ Nowling

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-09-21 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3630:
--
Description: 
A recent GraphX commit caused non-deterministic exceptions in unit tests so it 
was reverted (see SPARK-3400).

Separately, [~aash] observed the same exception stacktrace in an 
application-specific Kryo registrator:

{noformat}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
 
com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
 
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
...
{noformat}

This ticket is to identify the cause of the exception in the GraphX commit so 
the faulty commit can be fixed and merged back into master.

  was:
A recent GraphX commit caused non-deterministic exceptions in unit tests so it 
was reverted (see SPARK-3400).

Separately, [~aash] observed the same exception stacktrace in an 
application-specific Kryo registrator:

{noformat}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2) 
com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
 
com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
{noformat}

This ticket is to identify the cause of the exception in the GraphX commit so 
the faulty commit can be fixed and merged back into master.


> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Ankur Dave
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3400) GraphX unit tests fail nondeterministically

2014-09-21 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142823#comment-14142823
 ] 

Andrew Ash commented on SPARK-3400:
---

Filed as SPARK-3630

> GraphX unit tests fail nondeterministically
> ---
>
> Key: SPARK-3400
> URL: https://issues.apache.org/jira/browse/SPARK-3400
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Blocker
> Fix For: 1.1.1
>
>
> GraphX unit tests have been failing since the fix to SPARK-2823 was merged: 
> https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef.
>  Failures have appeared as Snappy parsing errors and shuffle 
> FileNotFoundExceptions. A local test showed that these failures occurred in 
> about 3/10 test runs.
> Reverting the mentioned commit seems to solve the problem. Since this is 
> blocking everyone else, I'm submitting a hotfix to do that, and we can 
> diagnose the problem in more detail afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-09-21 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3630:
-

 Summary: Identify cause of Kryo+Snappy PARSING_ERROR
 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Assignee: Ankur Dave


A recent GraphX commit caused non-deterministic exceptions in unit tests so it 
was reverted (see SPARK-3400).

Separately, [~aash] observed the same exception stacktrace in an 
application-specific Kryo registrator:

{noformat}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2) 
com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
 
com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
{noformat}

This ticket is to identify the cause of the exception in the GraphX commit so 
the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-09-21 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2630:
--
Description: 
Given one big file, such as text.4.3G, put it in one task, 

{code}
sc.textFile("text.4.3.G").coalesce(1).count()
{code}

In Web UI of Spark, you will see that the input size is 5.4M. 

  was:
Given one big file, such as text.4.3G, put it in one task, 

sc.textFile("text.4.3.G").coalesce(1).count()

In Web UI of Spark, you will see that the input size is 5.4M. 


> Input data size of CoalescedRDD is incorrect
> 
>
> Key: SPARK-2630
> URL: https://issues.apache.org/jira/browse/SPARK-2630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Assignee: Andrew Ash
>Priority: Blocker
> Attachments: overflow.tiff
>
>
> Given one big file, such as text.4.3G, put it in one task, 
> {code}
> sc.textFile("text.4.3.G").coalesce(1).count()
> {code}
> In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3629) Improvements to YARN doc

2014-09-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3629:
-
Description: 
Right now this doc starts off with a big list of config options, and only then 
tells you how to submit an app. It would be better to put that part and the 
packaging part first, and the config options only at the end.

In addition, the doc mentions yarn-cluster vs yarn-client as separate masters, 
which is inconsistent with the help output from spark-submit (which says to 
always use "yarn").

  was:Right now this doc starts off with a big list of config options, and only 
then tells you how to submit an app. It would be better to put that part and 
the packaging part first, and the config options only at the end.


> Improvements to YARN doc
> 
>
> Key: SPARK-3629
> URL: https://issues.apache.org/jira/browse/SPARK-3629
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Reporter: Matei Zaharia
>
> Right now this doc starts off with a big list of config options, and only 
> then tells you how to submit an app. It would be better to put that part and 
> the packaging part first, and the config options only at the end.
> In addition, the doc mentions yarn-cluster vs yarn-client as separate 
> masters, which is inconsistent with the help output from spark-submit (which 
> says to always use "yarn").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3629) Improvements to YARN doc

2014-09-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3629:
-
Summary: Improvements to YARN doc  (was: Improve ordering of YARN doc)

> Improvements to YARN doc
> 
>
> Key: SPARK-3629
> URL: https://issues.apache.org/jira/browse/SPARK-3629
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Reporter: Matei Zaharia
>
> Right now this doc starts off with a big list of config options, and only 
> then tells you how to submit an app. It would be better to put that part and 
> the packaging part first, and the config options only at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3629) Improve ordering of YARN doc

2014-09-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3629:


 Summary: Improve ordering of YARN doc
 Key: SPARK-3629
 URL: https://issues.apache.org/jira/browse/SPARK-3629
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Reporter: Matei Zaharia


Right now this doc starts off with a big list of config options, and only then 
tells you how to submit an app. It would be better to put that part and the 
packaging part first, and the config options only at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142772#comment-14142772
 ] 

Josh Rosen commented on SPARK-2321:
---

The scheduler has some data structures like StageInfo, TaskInfo, RDDInfo, etc. 
that expose some of the information that we might want in a user-facing 
progress API, but we can't  expose these classes in their current form since 
they're marked @DeveloperAPI and are full of public, mutable fields (the 
responses returned from our progress / status API need to be immutable).

Maybe we should stabilize these scheduler.*Info classes' public interfaces, 
make them immutable, and add a JobInfo class for capturing per-job information. 
 We can then register a new, private SparkListener for maintaining a view of 
stage progress and add methods to SparkContext that provide stable, pull-based 
access to the snapshots of job/stage/task state.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3628:
-
Target Version/s: 1.1.1, 1.2.0, 0.9.3, 1.0.3  (was: 1.1.1, 1.2.0, 1.0.3)

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3628:
-
Target Version/s: 1.1.1, 1.2.0, 0.9.3, 1.0.3  (was: 1.1.1, 1.2.0, 1.0.3)

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3628:
-
Target Version/s: 1.1.1, 1.2.0, 1.0.3  (was: 1.1.1, 1.2.0, 0.9.3, 1.0.3)

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142756#comment-14142756
 ] 

Matei Zaharia edited comment on SPARK-3628 at 9/21/14 10:49 PM:


BTW the problem is that this used to be guarded against in the TaskSetManager 
(see 
https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L254
 or  
https://github.com/apache/spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala#L436),
 and that went away at some point.


was (Author: matei):
BTW the problem is that this used to be guarded against in the TaskSetManager 
(see 
https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L254),
 and that went away at some point.

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142756#comment-14142756
 ] 

Matei Zaharia edited comment on SPARK-3628 at 9/21/14 10:43 PM:


BTW the problem is that this used to be guarded against in the TaskSetManager 
(see 
https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L254),
 and that went away at some point.


was (Author: matei):
BTW the problem is that this used to be guarded against in the TaskSetManager 
(see 
https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L253),
 and that went away at some point.

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142756#comment-14142756
 ] 

Matei Zaharia commented on SPARK-3628:
--

BTW the problem is that this used to be guarded against in the TaskSetManager 
(see 
https://github.com/apache/spark/blob/branch-0.6/core/src/main/scala/spark/scheduler/cluster/TaskSetManager.scala#L253),
 and that went away at some point.

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3628:


 Summary: Don't apply accumulator updates multiple times for tasks 
in result stages
 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Priority: Blocker


In previous versions of Spark, accumulator updates only got applied once for 
accumulators that are only used in actions (i.e. result stages), letting you 
use them to deterministically compute a result. Unfortunately, this got broken 
in some recent refactorings.

This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
issue is about applying the same semantics to intermediate stages too, which is 
more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3627) spark on yarn reports success even though job fails

2014-09-21 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3627:


 Summary: spark on yarn reports success even though job fails
 Key: SPARK-3627
 URL: https://issues.apache.org/jira/browse/SPARK-3627
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Priority: Critical


I was running a wordcount and saving the output to hdfs.  If the output 
directory already exists, yarn reports success even though the job fails since 
it requires the output directory to not be there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile

2014-09-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3595.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0

Thanks I've merged this into master. We can consider merging this into 1.1 as 
well later on. I decided not to do that yet because we've often found that 
changes around Hadoop configurations can produce unanticipated regressions. So 
let's see how this fares in master and if there is lots of demand we can 
backport a fix once it's been stable in master for a while.

> Spark should respect configured OutputCommitter when using saveAsHadoopFile
> ---
>
> Key: SPARK-3595
> URL: https://issues.apache.org/jira/browse/SPARK-3595
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Ian Hummel
>Assignee: Ian Hummel
> Fix For: 1.2.0
>
>
> When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be 
> a {{FileOutputCommitter}}.
> When using Spark on an EMR cluster to process and write files to/from S3, the 
> default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid 
> writing to a temporary directory and doing a copy.
> Will submit a patch via GitHub shortly.
> Cheers,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3595) Spark should respect configured OutputCommitter when using saveAsHadoopFile

2014-09-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3595:
---
Assignee: Ian Hummel

> Spark should respect configured OutputCommitter when using saveAsHadoopFile
> ---
>
> Key: SPARK-3595
> URL: https://issues.apache.org/jira/browse/SPARK-3595
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Ian Hummel
>Assignee: Ian Hummel
>
> When calling {{saveAsHadoopFile}}, Spark hardcodes the OutputCommitter to be 
> a {{FileOutputCommitter}}.
> When using Spark on an EMR cluster to process and write files to/from S3, the 
> default Hadoop configuration uses a {{DirectFileOutputCommitter}} to avoid 
> writing to a temporary directory and doing a copy.
> Will submit a patch via GitHub shortly.
> Cheers,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-21 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142631#comment-14142631
 ] 

RJ Nowling commented on SPARK-3614:
---

I would like to work on this.

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3626) Replace AsyncRDDActions with a more general async. API

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142588#comment-14142588
 ] 

Apache Spark commented on SPARK-3626:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2482

> Replace AsyncRDDActions with a more general async. API
> --
>
> Key: SPARK-3626
> URL: https://issues.apache.org/jira/browse/SPARK-3626
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The experimental AsyncRDDActions APIs seem to only exist in order to enable 
> job cancellation.
> We've been considering extending these APIs to support progress monitoring, 
> but this would require stabilizing them so they're no longer 
> {{@Experimental}}.
> Instead, I propose to replace all of the AsyncRDDActions with a mechanism 
> based on job groups which allows arbitrary computations to be run in job 
> groups and supports cancellation / monitoring of Spark jobs launched from 
> those computations.
> (full design pending; see my GitHub PR for more details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-542) Cache Miss when machine have multiple hostname

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-542:

Priority: Minor  (was: Blocker)

> Cache Miss when machine have multiple hostname
> --
>
> Key: SPARK-542
> URL: https://issues.apache.org/jira/browse/SPARK-542
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: frankvictor
>Priority: Minor
>
> HI, I encountered a weird runtime of pagerank in last few day.
> After debugging the job, I found it was caused by the DNS name.
> The machines of my cluster have multiple hostname, for example, slave 1 have 
> name (c001 and c001.cm.cluster)
> when spark adding cache in cacheTracker, it get "c001" and add cache use it.
> But when schedule task in SimpleJob, the msos offer give spark 
> "c001.cm.cluster".
> so It will never get preferred location!
> I thinks spark should handle the multiple hostname case(by using ip instead 
> of hostname, or some other methods).
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3626) Replace AsyncRDDActions with a more general async. API

2014-09-21 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3626:
-

 Summary: Replace AsyncRDDActions with a more general async. API
 Key: SPARK-3626
 URL: https://issues.apache.org/jira/browse/SPARK-3626
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


The experimental AsyncRDDActions APIs seem to only exist in order to enable job 
cancellation.

We've been considering extending these APIs to support progress monitoring, but 
this would require stabilizing them so they're no longer {{@Experimental}}.

Instead, I propose to replace all of the AsyncRDDActions with a mechanism based 
on job groups which allows arbitrary computations to be run in job groups and 
supports cancellation / monitoring of Spark jobs launched from those 
computations.

(full design pending; see my GitHub PR for more details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-578) Fix interpreter code generation to only capture needed dependencies

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142558#comment-14142558
 ] 

Matthew Farrellee commented on SPARK-578:
-

[~matei] is this related to slimming down he assembly?

> Fix interpreter code generation to only capture needed dependencies
> ---
>
> Key: SPARK-578
> URL: https://issues.apache.org/jira/browse/SPARK-578
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-575.
---
Resolution: Incomplete

> Maintain a cache of JARs on each node to avoid unnecessary copying
> --
>
> Key: SPARK-575
> URL: https://issues.apache.org/jira/browse/SPARK-575
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142553#comment-14142553
 ] 

Matthew Farrellee commented on SPARK-575:
-

[~joshrosen] is quite correct.

this issue looks inactive. i'm going to close it out, but as always feel free 
to re-open. i can think of a few ways this could be done, and not all need 
spark code to be changed.

> Maintain a cache of JARs on each node to avoid unnecessary copying
> --
>
> Key: SPARK-575
> URL: https://issues.apache.org/jira/browse/SPARK-575
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-584) Pass slave ip address when starting a cluster

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142545#comment-14142545
 ] 

Matthew Farrellee commented on SPARK-584:
-

what's the use case for this?

> Pass slave ip address when starting a cluster 
> --
>
> Key: SPARK-584
> URL: https://issues.apache.org/jira/browse/SPARK-584
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.6.0
>Priority: Minor
> Attachments: 0001-fix-for-SPARK-584.patch
>
>
> Pass slave ip address from conf while starting a cluster:
> bin/start-slaves.sh is used to start all the slaves in the cluster. While the 
> slave class takes a --ip argument, we don't pass the ip address from the 
> conf/slaves. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-604) reconnect if mesos slaves dies

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-604:

Component/s: Mesos

> reconnect if mesos slaves dies
> --
>
> Key: SPARK-604
> URL: https://issues.apache.org/jira/browse/SPARK-604
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>
> when running on mesos, if a slave goes down, spark doesn't try to reassign 
> the work to another machine.  Even if the slave comes back up, the job is 
> doomed.
> Currently when this happens, we just see this in the driver logs:
> 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost: 
> 201210312057-1560611338-5050-24091-52
> Exception in thread "Thread-346" java.util.NoSuchElementException: key not 
> found: value: "201210312057-1560611338-5050-24091-52"
> at scala.collection.MapLike$class.default(MapLike.scala:224)
> at scala.collection.mutable.HashMap.default(HashMap.scala:43)
> at scala.collection.MapLike$class.apply(MapLike.scala:135)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:43)
> at 
> spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255)
> at 
> spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275)
> 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned 
> with code DRIVER_ABORTED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-610) Support master failover in standalone mode

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142528#comment-14142528
 ] 

Matthew Farrellee commented on SPARK-610:
-

[~matei] given YARN and Mesos implementations, is this something the standalone 
mode should strive to do?

> Support master failover in standalone mode
> --
>
> Key: SPARK-610
> URL: https://issues.apache.org/jira/browse/SPARK-610
> Project: Spark
>  Issue Type: New Feature
>Reporter: Matei Zaharia
>
> The standalone deploy mode is quite simple, which shouldn't make it too bad 
> to add support for master failover using ZooKeeper or something similar. This 
> would really up its usefulness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2014-09-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142524#comment-14142524
 ] 

Sandy Ryza commented on SPARK-3577:
---

No problem.  Yeah, I agree that a spill time metric would be useful.

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> The ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter.  
> The write time recorded in those metrics is never used.  We should probably 
> add task metrics to report this spill time, since for shuffles, this would 
> have previously been reported as part of shuffle write time (with the 
> original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142518#comment-14142518
 ] 

Apache Spark commented on SPARK-2058:
-

User 'EugenCepoi' has created a pull request for this issue:
https://github.com/apache/spark/pull/2481

> SPARK_CONF_DIR should override all present configs
> --
>
> Key: SPARK-2058
> URL: https://issues.apache.org/jira/browse/SPARK-2058
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Eugen Cepoi
>Priority: Critical
>
> When the user defines SPARK_CONF_DIR I think spark should use all the configs 
> available there not only spark-env.
> This involves changing SparkSubmitArguments to first read from 
> SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
> computed classpath for configs such as log4j, metrics, etc.
> I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-690) Stack overflow when running pagerank more than 10000 iterators

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-690.
---
Resolution: Unresolved

> Stack overflow when running pagerank more than 1 iterators
> --
>
> Key: SPARK-690
> URL: https://issues.apache.org/jira/browse/SPARK-690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.6.1
>Reporter: xiajunluan
>
> when I run PageRank example more than 1 iterators, Job client will report 
> stack overflow errors.
> 13/02/07 13:41:40 INFO CacheTracker: Registering RDD ID 57993 with cache
> Exception in thread "DAGScheduler" java.lang.StackOverflowError
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:467)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1281)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
>   at 
> org.jboss.netty.akka.util.HashedWheelTimer.scheduleTimeout(HashedWheelTimer.java:277)
>   at 
> org.jboss.netty.akka.util.HashedWheelTimer.newTimeout(HashedWheelTimer.java:264)
>   at akka.actor.DefaultScheduler.scheduleOnce(Scheduler.scala:186)
>   at akka.pattern.PromiseActorRef$.apply(AskSupport.scala:274)
>   at akka.pattern.AskSupport$class.ask(AskSupport.scala:83)
>   at akka.pattern.package$.ask(package.scala:43)
>   at akka.pattern.AskSupport$AskableActorRef.ask(AskSupport.scala:123)
>   at spark.CacheTracker.askTracker(CacheTracker.scala:121)
>   at spark.CacheTracker.communicate(CacheTracker.scala:131)
>   at spark.CacheTracker.registerRDD(CacheTracker.scala:142)
>   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:149)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:155)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
>   at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>   at scala.collection.immutable.List.foreach(List.scala:76)
>   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:150)
>   at spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:160)
>   at spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:131)
>   at 
> spark.scheduler.DAGScheduler.getShuffleMapStage(DAGScheduler.scala:111)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:153)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
>   at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-690) Stack overflow when running pagerank more than 10000 iterators

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142511#comment-14142511
 ] 

Matthew Farrellee commented on SPARK-690:
-

[~andrew xia] this is reported against a very old version. i'm going to close 
it out, but if you can reproduce please re-open

> Stack overflow when running pagerank more than 1 iterators
> --
>
> Key: SPARK-690
> URL: https://issues.apache.org/jira/browse/SPARK-690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.6.1
>Reporter: xiajunluan
>
> when I run PageRank example more than 1 iterators, Job client will report 
> stack overflow errors.
> 13/02/07 13:41:40 INFO CacheTracker: Registering RDD ID 57993 with cache
> Exception in thread "DAGScheduler" java.lang.StackOverflowError
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:467)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1281)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
>   at 
> org.jboss.netty.akka.util.HashedWheelTimer.scheduleTimeout(HashedWheelTimer.java:277)
>   at 
> org.jboss.netty.akka.util.HashedWheelTimer.newTimeout(HashedWheelTimer.java:264)
>   at akka.actor.DefaultScheduler.scheduleOnce(Scheduler.scala:186)
>   at akka.pattern.PromiseActorRef$.apply(AskSupport.scala:274)
>   at akka.pattern.AskSupport$class.ask(AskSupport.scala:83)
>   at akka.pattern.package$.ask(package.scala:43)
>   at akka.pattern.AskSupport$AskableActorRef.ask(AskSupport.scala:123)
>   at spark.CacheTracker.askTracker(CacheTracker.scala:121)
>   at spark.CacheTracker.communicate(CacheTracker.scala:131)
>   at spark.CacheTracker.registerRDD(CacheTracker.scala:142)
>   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:149)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:155)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
>   at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>   at scala.collection.immutable.List.foreach(List.scala:76)
>   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:150)
>   at spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:160)
>   at spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:131)
>   at 
> spark.scheduler.DAGScheduler.getShuffleMapStage(DAGScheduler.scala:111)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:153)
>   at 
> spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
>   at 
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-709) Dropping a block reports 0 bytes

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-709.
---
Resolution: Incomplete

[~rxin] there isn't enough information to make progress on this, but feel free 
to re-open if you so desire.

> Dropping a block reports 0 bytes
> 
>
> Key: SPARK-709
> URL: https://issues.apache.org/jira/browse/SPARK-709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-718) NPE when performing action during transformation

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142506#comment-14142506
 ] 

Matthew Farrellee commented on SPARK-718:
-

Spark simply does not support nesting RDDs in this fashion. you'll get a more 
prompt response and information with the user list, see 
http://spark.apache.org/community.html. i'm going to close this issue, but if 
you want feel free to re-open it.

> NPE when performing action during transformation
> 
>
> Key: SPARK-718
> URL: https://issues.apache.org/jira/browse/SPARK-718
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Krzywicki
>
> Running the spark shell:
> The following code fails with a NPE when trying to collect the resulting RDD:
> {code:java}
> val data = sc.parallelize(1 to 10)
> data.map(i => data.count).collect
> {code}
> {code:java}
> ERROR local.LocalScheduler: Exception in task 0
> java.lang.NullPointerException
> at spark.RDD.count(RDD.scala:490)
> at 
> $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcJI$sp(:15)
> at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15)
> at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15)
> at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
> at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:399)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
> at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:399)
> at spark.RDD$$anonfun$1.apply(RDD.scala:389)
> at spark.RDD$$anonfun$1.apply(RDD.scala:389)
> at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
> at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
> at spark.scheduler.ResultTask.run(ResultTask.scala:76)
> at 
> spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74)
> at 
> spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-718) NPE when performing action during transformation

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-718.
---
Resolution: Done

> NPE when performing action during transformation
> 
>
> Key: SPARK-718
> URL: https://issues.apache.org/jira/browse/SPARK-718
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Krzywicki
>
> Running the spark shell:
> The following code fails with a NPE when trying to collect the resulting RDD:
> {code:java}
> val data = sc.parallelize(1 to 10)
> data.map(i => data.count).collect
> {code}
> {code:java}
> ERROR local.LocalScheduler: Exception in task 0
> java.lang.NullPointerException
> at spark.RDD.count(RDD.scala:490)
> at 
> $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcJI$sp(:15)
> at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15)
> at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:15)
> at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
> at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:399)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
> at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:399)
> at spark.RDD$$anonfun$1.apply(RDD.scala:389)
> at spark.RDD$$anonfun$1.apply(RDD.scala:389)
> at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
> at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
> at spark.scheduler.ResultTask.run(ResultTask.scala:76)
> at 
> spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74)
> at 
> spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-567) Unified directory structure for temporary data

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-567.
---
Resolution: Incomplete

please re-open with additional details for how this could be implemented

> Unified directory structure for temporary data
> --
>
> Key: SPARK-567
> URL: https://issues.apache.org/jira/browse/SPARK-567
> Project: Spark
>  Issue Type: Improvement
>Reporter: Mosharaf Chowdhury
>
> Broadcast, shuffle, and unforeseen use cases should use the same directory 
> structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-559) Automatically register all classes used in fields of a class with Kryo

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142479#comment-14142479
 ] 

Matthew Farrellee commented on SPARK-559:
-

the last comment on this, from 2 years ago, suggest this is resolved w/ an 
upgrade to kryo 2.x. i'm going to close this, but please re-open if you 
disagree.

> Automatically register all classes used in fields of a class with Kryo
> --
>
> Key: SPARK-559
> URL: https://issues.apache.org/jira/browse/SPARK-559
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-559) Automatically register all classes used in fields of a class with Kryo

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-559.
---
Resolution: Done

> Automatically register all classes used in fields of a class with Kryo
> --
>
> Key: SPARK-559
> URL: https://issues.apache.org/jira/browse/SPARK-559
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-550) Hiding the default spark context in the spark shell creates serialization issues

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142477#comment-14142477
 ] 

Matthew Farrellee commented on SPARK-550:
-

a lot of code has changed in this space over the past 2 years. i'm going to 
close this, but feel free to re-open if you feel it's still an issue.

> Hiding the default spark context in the spark shell creates serialization 
> issues
> 
>
> Key: SPARK-550
> URL: https://issues.apache.org/jira/browse/SPARK-550
> Project: Spark
>  Issue Type: Bug
>Reporter: tjhunter
>
> I copy-pasted a piece of code along these lines in the spark shell:
> ...
> val sc = new SparkContext("local[%d]" format num_splits,"myframework")
> val my_rdd = sc.textFile(...)
> my_rdd.count()
> This leads to the shell crashing with a java.io.NotSerializableException: 
> spark.SparkContext
> It took me a while to realize it was due to the new spark context created. 
> Maybe a warning/error should be triggered if the user tries to change the 
> definition of sc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-550) Hiding the default spark context in the spark shell creates serialization issues

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-550.
---
Resolution: Done

> Hiding the default spark context in the spark shell creates serialization 
> issues
> 
>
> Key: SPARK-550
> URL: https://issues.apache.org/jira/browse/SPARK-550
> Project: Spark
>  Issue Type: Bug
>Reporter: tjhunter
>
> I copy-pasted a piece of code along these lines in the spark shell:
> ...
> val sc = new SparkContext("local[%d]" format num_splits,"myframework")
> val my_rdd = sc.textFile(...)
> my_rdd.count()
> This leads to the shell crashing with a java.io.NotSerializableException: 
> spark.SparkContext
> It took me a while to realize it was due to the new spark context created. 
> Maybe a warning/error should be triggered if the user tries to change the 
> definition of sc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-542) Cache Miss when machine have multiple hostname

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-542:

Component/s: Mesos
   Priority: Blocker

> Cache Miss when machine have multiple hostname
> --
>
> Key: SPARK-542
> URL: https://issues.apache.org/jira/browse/SPARK-542
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: frankvictor
>Priority: Blocker
>
> HI, I encountered a weird runtime of pagerank in last few day.
> After debugging the job, I found it was caused by the DNS name.
> The machines of my cluster have multiple hostname, for example, slave 1 have 
> name (c001 and c001.cm.cluster)
> when spark adding cache in cacheTracker, it get "c001" and add cache use it.
> But when schedule task in SimpleJob, the msos offer give spark 
> "c001.cm.cluster".
> so It will never get preferred location!
> I thinks spark should handle the multiple hostname case(by using ip instead 
> of hostname, or some other methods).
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-538) INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-538.
---
Resolution: Done

> INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone
> -
>
> Key: SPARK-538
> URL: https://issues.apache.org/jira/browse/SPARK-538
> Project: Spark
>  Issue Type: Bug
>Reporter: vince67
>
> Hi Matei,
>Maybe I can't descibe it clearly.
>We run masters or slaves on different machines,it is success.
>But when we run spark.examples.SparkPi on the master , our 
> process hangs,we have not got the result.
>Descirption like these:
>  
>
> 12/09/02 16:47:54 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
> = 339585269
> 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Registered actor on port 7077
> 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Started slave cache (size 
> 323.9MB) on vince67-ThinkCentre-
> 12/09/02 16:47:54 INFO spark.MapOutputTrackerActor: Registered actor on port 
> 7077
> 12/09/02 16:47:54 INFO spark.ShuffleManager: Shuffle dir: 
> /tmp/spark-local-3e79b235-1b94-44d1-823b-0369f6698688/shuffle
> 12/09/02 16:47:54 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/02 16:47:54 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:49578 STARTING
> 12/09/02 16:47:54 INFO spark.ShuffleManager: Local URI: 
> http://ip.ip.ip.ip:49578
> 12/09/02 16:47:55 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/02 16:47:55 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:49600 STARTING
> 12/09/02 16:47:55 INFO broadcast.HttpBroadcast: Broadcast server started at 
> http://ip.ip.ip.ip:49600
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Registered as framework ID 
> 201209021640-74572372-5050-16898-0004
> 12/09/02 16:47:55 INFO spark.SparkContext: Starting job...
> 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 1 with cache
> 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
> partitions
> 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 0 with cache
> 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
> partitions
> 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Asked for current cache 
> locations
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Final stage: Stage 0
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Parents of final stage: List()
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Missing parents: List()
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
> missing parents
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Got a job with 2 tasks
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Adding job with ID 0
> 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 151 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:1 as TID 3 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 5 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:0 as TID 4 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 3 (task 0:1)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:1 as TID 5 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 2 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:58 INFO spark.SimpleJob: Lost TID 4 (task 0:0)

[jira] [Commented] (SPARK-538) INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142475#comment-14142475
 ] 

Matthew Farrellee commented on SPARK-538:
-

this is a reasonable question for the user list, see 
http://spark.apache.org/community.html. i'm going to close this in favor of 
user list interaction. if you disagree, please re-open.

> INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone
> -
>
> Key: SPARK-538
> URL: https://issues.apache.org/jira/browse/SPARK-538
> Project: Spark
>  Issue Type: Bug
>Reporter: vince67
>
> Hi Matei,
>Maybe I can't descibe it clearly.
>We run masters or slaves on different machines,it is success.
>But when we run spark.examples.SparkPi on the master , our 
> process hangs,we have not got the result.
>Descirption like these:
>  
>
> 12/09/02 16:47:54 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
> = 339585269
> 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Registered actor on port 7077
> 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Started slave cache (size 
> 323.9MB) on vince67-ThinkCentre-
> 12/09/02 16:47:54 INFO spark.MapOutputTrackerActor: Registered actor on port 
> 7077
> 12/09/02 16:47:54 INFO spark.ShuffleManager: Shuffle dir: 
> /tmp/spark-local-3e79b235-1b94-44d1-823b-0369f6698688/shuffle
> 12/09/02 16:47:54 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/02 16:47:54 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:49578 STARTING
> 12/09/02 16:47:54 INFO spark.ShuffleManager: Local URI: 
> http://ip.ip.ip.ip:49578
> 12/09/02 16:47:55 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/02 16:47:55 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:49600 STARTING
> 12/09/02 16:47:55 INFO broadcast.HttpBroadcast: Broadcast server started at 
> http://ip.ip.ip.ip:49600
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Registered as framework ID 
> 201209021640-74572372-5050-16898-0004
> 12/09/02 16:47:55 INFO spark.SparkContext: Starting job...
> 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 1 with cache
> 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
> partitions
> 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 0 with cache
> 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
> partitions
> 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Asked for current cache 
> locations
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Final stage: Stage 0
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Parents of final stage: List()
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Missing parents: List()
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
> missing parents
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Got a job with 2 tasks
> 12/09/02 16:47:55 INFO spark.MesosScheduler: Adding job with ID 0
> 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 151 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:1 as TID 3 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 5 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:0 as TID 4 on slave 
> 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 3 (task 0:1)
> 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:1 as TID 5 on slave 
> 201209021640-74572372-5050-16898-2: lmr

[jira] [Resolved] (SPARK-537) driver.run() returned with code DRIVER_ABORTED

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-537.
-
   Resolution: Fixed
Fix Version/s: 1.0.0

> driver.run() returned with code DRIVER_ABORTED
> --
>
> Key: SPARK-537
> URL: https://issues.apache.org/jira/browse/SPARK-537
> Project: Spark
>  Issue Type: Bug
>Reporter: yshaw
> Fix For: 1.0.0
>
>
> Hi there,
> When I try to run Spark on Mesos as a cluster, some error happen like this:
> ```
>  ./run spark.examples.SparkPi *.*.*.*:5050
> 12/09/07 14:49:28 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
> = 994836480
> 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Registered actor on port 7077
> 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Started slave cache (size 
> 948.8MB) on shawpc
> 12/09/07 14:49:28 INFO spark.MapOutputTrackerActor: Registered actor on port 
> 7077
> 12/09/07 14:49:28 INFO spark.ShuffleManager: Shuffle dir: 
> /tmp/spark-local-81220c47-bc43-4809-ac48-5e3e8e023c8a/shuffle
> 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:57595 STARTING
> 12/09/07 14:49:28 INFO spark.ShuffleManager: Local URI: http://127.0.1.1:57595
> 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:60113 STARTING
> 12/09/07 14:49:28 INFO broadcast.HttpBroadcast: Broadcast server started at 
> http://127.0.1.1:60113
> 12/09/07 14:49:28 INFO spark.MesosScheduler: Temp directory for JARs: 
> /tmp/spark-d541f37c-ae35-476c-b2fc-9908b0739f50
> 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:50511 STARTING
> 12/09/07 14:49:28 INFO spark.MesosScheduler: JAR server started at 
> http://127.0.1.1:50511
> 12/09/07 14:49:28 INFO spark.MesosScheduler: Registered as framework ID 
> 201209071448-846324308-5050-26925-
> 12/09/07 14:49:29 INFO spark.SparkContext: Starting job...
> 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache
> 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
> partitions
> 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache
> 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
> partitions
> 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Asked for current cache 
> locations
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Final stage: Stage 0
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Parents of final stage: List()
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Missing parents: List()
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
> missing parents
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Got a job with 2 tasks
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Adding job with ID 0
> 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 52 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 0 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 3 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 2 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:1 as TID 4 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:32 INFO spark.SimpleJob: Lost TID 3 (task 0:0)
> 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:0 as TID 5 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> t

[jira] [Commented] (SPARK-537) driver.run() returned with code DRIVER_ABORTED

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142474#comment-14142474
 ] 

Matthew Farrellee commented on SPARK-537:
-

this should be resolved by a number of fixes in 1.0. please re-open if it still 
reproduces.

> driver.run() returned with code DRIVER_ABORTED
> --
>
> Key: SPARK-537
> URL: https://issues.apache.org/jira/browse/SPARK-537
> Project: Spark
>  Issue Type: Bug
>Reporter: yshaw
>
> Hi there,
> When I try to run Spark on Mesos as a cluster, some error happen like this:
> ```
>  ./run spark.examples.SparkPi *.*.*.*:5050
> 12/09/07 14:49:28 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
> = 994836480
> 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Registered actor on port 7077
> 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Started slave cache (size 
> 948.8MB) on shawpc
> 12/09/07 14:49:28 INFO spark.MapOutputTrackerActor: Registered actor on port 
> 7077
> 12/09/07 14:49:28 INFO spark.ShuffleManager: Shuffle dir: 
> /tmp/spark-local-81220c47-bc43-4809-ac48-5e3e8e023c8a/shuffle
> 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:57595 STARTING
> 12/09/07 14:49:28 INFO spark.ShuffleManager: Local URI: http://127.0.1.1:57595
> 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:60113 STARTING
> 12/09/07 14:49:28 INFO broadcast.HttpBroadcast: Broadcast server started at 
> http://127.0.1.1:60113
> 12/09/07 14:49:28 INFO spark.MesosScheduler: Temp directory for JARs: 
> /tmp/spark-d541f37c-ae35-476c-b2fc-9908b0739f50
> 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
> 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:50511 STARTING
> 12/09/07 14:49:28 INFO spark.MesosScheduler: JAR server started at 
> http://127.0.1.1:50511
> 12/09/07 14:49:28 INFO spark.MesosScheduler: Registered as framework ID 
> 201209071448-846324308-5050-26925-
> 12/09/07 14:49:29 INFO spark.SparkContext: Starting job...
> 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache
> 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
> partitions
> 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache
> 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
> partitions
> 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Asked for current cache 
> locations
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Final stage: Stage 0
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Parents of final stage: List()
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Missing parents: List()
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
> missing parents
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Got a job with 2 tasks
> 12/09/07 14:49:29 INFO spark.MesosScheduler: Adding job with ID 0
> 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 52 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 0 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 3 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
> took 2 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:1 as TID 4 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
> took 1 ms to serialize by spark.JavaSerializerInstance
> 12/09/07 14:49:32 INFO spark.SimpleJob: Lost TID 3 (task 0:0)
> 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:0 as TID 5 on slave 
> 201209071448-846324308-5050-26925-0: shawpc (preferred)
> 12/09/07 14

[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142471#comment-14142471
 ] 

Apache Spark commented on SPARK-3625:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2480

> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
>
> The reproduce code:
> {code}
>  sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000))
> c.count
> c.checkpoint()
> c.count
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-21 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3625:
---
Description: 
The reproduce code:

{code}
 sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000))
c.count
c.checkpoint()
c.count
{code}

  was:
The reproduce code:

{code}
 sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000))
c.count
c.checkpoint()
{code}


> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
>
> The reproduce code:
> {code}
>  sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000))
> c.count
> c.checkpoint()
> c.count
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-21 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-3625:
--

 Summary: In some cases, the RDD.checkpoint does not work
 Key: SPARK-3625
 URL: https://issues.apache.org/jira/browse/SPARK-3625
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker


The reproduce code:

{code}
 sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000))
c.count
c.checkpoint()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3593) Support Sorting of Binary Type Data

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142468#comment-14142468
 ] 

Matthew Farrellee commented on SPARK-3593:
--

[~pmagid] will you provide some example code that demonstrates your issue?

> Support Sorting of Binary Type Data
> ---
>
> Key: SPARK-3593
> URL: https://issues.apache.org/jira/browse/SPARK-3593
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Paul Magid
>
> If you try sorting on a binary field you currently get an exception.   Please 
> add support for binary data type sorting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142466#comment-14142466
 ] 

Xuefu Zhang commented on SPARK-3621:


I understand RDD is a concept existing only in the driver. However, accessing 
the data in Spark job doesn't have to be in the form of RDD. An iterator over 
the underlying data is sufficient, as long as the data is already shipped to 
the node when the job starts to run. One way to identify the shipped RDD and 
the iterator afterwards could be a UUID.

Hive on Spark isn't using Spark's transformations to do map-join, or join in 
general. Hive's own implementation is to build hash maps for the small tables 
when the join starts, and then do key lookups while streaming thru the big 
table. For this, small table data (which can be a result RDD of another Spark 
job) needs to be shipped to all nodes that do the join.

> Provide a way to broadcast an RDD (instead of just a variable made of the 
> RDD) so that a job can access
> ---
>
> Key: SPARK-3621
> URL: https://issues.apache.org/jira/browse/SPARK-3621
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Xuefu Zhang
>
> In some cases, such as Hive's way of doing map-side join, it would be 
> benefcial to allow client program to broadcast RDDs rather than just 
> variables made of these RDDs. Broadcasting a variable made of RDDs requires 
> all RDD data be collected to the driver and that the variable be shipped to 
> the cluster after being made. It would be more performing if driver just 
> broadcasts the RDDs and uses the corresponding data in jobs (such building 
> hashmaps at executors).
> Tez has a broadcast edge which can ship data from previous stage to the next 
> stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-637) Create troubleshooting checklist

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142463#comment-14142463
 ] 

Matthew Farrellee commented on SPARK-637:
-

this is a good idea, and it will take a significant amount of effort. it looks 
like nothing has happened for almost 2 years. i'm going to close this, but feel 
free to re-open and push forward with it.

> Create troubleshooting checklist
> 
>
> Key: SPARK-637
> URL: https://issues.apache.org/jira/browse/SPARK-637
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Josh Rosen
>
> We should provide a checklist for troubleshooting common Spark problems.
> For example, it could include steps like "check that the Spark code was 
> copied to all nodes" and "check that the workers successfully connect to the 
> master."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-637) Create troubleshooting checklist

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-637.
---
Resolution: Later

> Create troubleshooting checklist
> 
>
> Key: SPARK-637
> URL: https://issues.apache.org/jira/browse/SPARK-637
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Josh Rosen
>
> We should provide a checklist for troubleshooting common Spark problems.
> For example, it could include steps like "check that the Spark code was 
> copied to all nodes" and "check that the workers successfully connect to the 
> master."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-719) Add FAQ page to documentation or webpage

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142459#comment-14142459
 ] 

Matthew Farrellee commented on SPARK-719:
-

it looks like this has some good content, but it's stale and likely needs 
vetting.

the new FAQ location is http://spark.apache.org/faq.html

i'm going to close this since there has been no progress. note - it'll still be 
available via search

feel free to re-open if you disagree.

> Add FAQ page to documentation or webpage
> 
>
> Key: SPARK-719
> URL: https://issues.apache.org/jira/browse/SPARK-719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andy Konwinski
>Assignee: Andy Konwinski
>
> Lots of issues on the mailing list are redundant (e.g., Patrick mentioned 
> this question has been asked/answered multiple times 
> https://groups.google.com/d/msg/spark-users/-mYn6BF-Y5Y/8qeXuxs8_d0J).
> We should put the solutions to common problems on an FAQ page in the 
> documentation or on the webpage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-719) Add FAQ page to documentation or webpage

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-719.
---
   Resolution: Done
Fix Version/s: (was: 0.7.1)

> Add FAQ page to documentation or webpage
> 
>
> Key: SPARK-719
> URL: https://issues.apache.org/jira/browse/SPARK-719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andy Konwinski
>Assignee: Andy Konwinski
>
> Lots of issues on the mailing list are redundant (e.g., Patrick mentioned 
> this question has been asked/answered multiple times 
> https://groups.google.com/d/msg/spark-users/-mYn6BF-Y5Y/8qeXuxs8_d0J).
> We should put the solutions to common problems on an FAQ page in the 
> documentation or on the webpage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-614) Make last 4 digits of framework id in standalone mode logging monotonically increasing

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142456#comment-14142456
 ] 

Matthew Farrellee commented on SPARK-614:
-

it looks like nothing has happened with this in the past 23 months. i'm going 
to close this, but feel free to re-open.

> Make last 4 digits of framework id in standalone mode logging monotonically 
> increasing
> --
>
> Key: SPARK-614
> URL: https://issues.apache.org/jira/browse/SPARK-614
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Denny Britz
>
> In mesos mode, the work log directories are monotonically increasing, and 
> makes it very easy to spot a folder and go into it (e.g. only need to type 
> *[last4digit]).
> We lost this in the standalone mode, as seen in this example. The last four 
> digits would go up and down 
> drwxr-xr-x 3 root root 4096 Nov  8 08:03 job-20121108080355-
> drwxr-xr-x 3 root root 4096 Nov  8 08:04 job-20121108080450-0001
> drwxr-xr-x 3 root root 4096 Nov  8 08:07 job-20121108080757-0002
> drwxr-xr-x 3 root root 4096 Nov  8 08:10 job-20121108081014-0003
> drwxr-xr-x 3 root root 4096 Nov  8 08:23 job-20121108082316-0004
> drwxr-xr-x 3 root root 4096 Nov  8 08:26 job-20121108082616-0005
> drwxr-xr-x 3 root root 4096 Nov  8 08:30 job-20121108083034-0006
> drwxr-xr-x 3 root root 4096 Nov  8 08:35 job-20121108083514-0007
> drwxr-xr-x 3 root root 4096 Nov  8 08:38 job-20121108083807-0008
> drwxr-xr-x 3 root root 4096 Nov  8 08:41 job-20121108084105-0009
> drwxr-xr-x 3 root root 4096 Nov  8 08:42 job-20121108084242-0010
> drwxr-xr-x 3 root root 4096 Nov  8 08:45 job-20121108084512-0011
> drwxr-xr-x 3 root root 4096 Nov  8 09:01 job-20121108090113-
> drwxr-xr-x 3 root root 4096 Nov  8 09:15 job-20121108091536-0001
> drwxr-xr-x 3 root root 4096 Nov  8 09:24 job-20121108092341-0003
> drwxr-xr-x 3 root root 4096 Nov  8 09:27 job-20121108092703-
> drwxr-xr-x 3 root root 4096 Nov  8 09:46 job-20121108094629-0001
> drwxr-xr-x 3 root root 4096 Nov  8 09:48 job-20121108094809-0002
> drwxr-xr-x 3 root root 4096 Nov  8 10:04 job-20121108100418-0003
> drwxr-xr-x 3 root root 4096 Nov  8 10:18 job-20121108101814-0004
> drwxr-xr-x 3 root root 4096 Nov  8 10:22 job-20121108102207-0005
> drwxr-xr-x 3 root root 4096 Nov  8 18:48 job-20121108184842-0006
> drwxr-xr-x 3 root root 4096 Nov  8 18:49 job-20121108184932-0007
> drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185007-0008
> drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185040-0009
> drwxr-xr-x 3 root root 4096 Nov  8 18:51 job-20121108185127-0010
> drwxr-xr-x 3 root root 4096 Nov  8 18:54 job-20121108185428-0011
> drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185837-0012
> drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185854-0013
> drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190005-0014
> drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190059-0015
> drwxr-xr-x 3 root root 4096 Nov  8 19:10 job-20121108191010-0016
> drwxr-xr-x 3 root root 4096 Nov  8 19:15 job-20121108191508-0017
> drwxr-xr-x 3 root root 4096 Nov  8 19:21 job-20121108192125-0018
> drwxr-xr-x 3 root root 4096 Nov  8 19:23 job-20121108192329-0019
> drwxr-xr-x 3 root root 4096 Nov  8 19:26 job-20121108192638-0020
> drwxr-xr-x 3 root root 4096 Nov  8 19:35 job-20121108193554-0022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-614) Make last 4 digits of framework id in standalone mode logging monotonically increasing

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-614.
---
   Resolution: Unresolved
Fix Version/s: (was: 0.7.1)

> Make last 4 digits of framework id in standalone mode logging monotonically 
> increasing
> --
>
> Key: SPARK-614
> URL: https://issues.apache.org/jira/browse/SPARK-614
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Denny Britz
>
> In mesos mode, the work log directories are monotonically increasing, and 
> makes it very easy to spot a folder and go into it (e.g. only need to type 
> *[last4digit]).
> We lost this in the standalone mode, as seen in this example. The last four 
> digits would go up and down 
> drwxr-xr-x 3 root root 4096 Nov  8 08:03 job-20121108080355-
> drwxr-xr-x 3 root root 4096 Nov  8 08:04 job-20121108080450-0001
> drwxr-xr-x 3 root root 4096 Nov  8 08:07 job-20121108080757-0002
> drwxr-xr-x 3 root root 4096 Nov  8 08:10 job-20121108081014-0003
> drwxr-xr-x 3 root root 4096 Nov  8 08:23 job-20121108082316-0004
> drwxr-xr-x 3 root root 4096 Nov  8 08:26 job-20121108082616-0005
> drwxr-xr-x 3 root root 4096 Nov  8 08:30 job-20121108083034-0006
> drwxr-xr-x 3 root root 4096 Nov  8 08:35 job-20121108083514-0007
> drwxr-xr-x 3 root root 4096 Nov  8 08:38 job-20121108083807-0008
> drwxr-xr-x 3 root root 4096 Nov  8 08:41 job-20121108084105-0009
> drwxr-xr-x 3 root root 4096 Nov  8 08:42 job-20121108084242-0010
> drwxr-xr-x 3 root root 4096 Nov  8 08:45 job-20121108084512-0011
> drwxr-xr-x 3 root root 4096 Nov  8 09:01 job-20121108090113-
> drwxr-xr-x 3 root root 4096 Nov  8 09:15 job-20121108091536-0001
> drwxr-xr-x 3 root root 4096 Nov  8 09:24 job-20121108092341-0003
> drwxr-xr-x 3 root root 4096 Nov  8 09:27 job-20121108092703-
> drwxr-xr-x 3 root root 4096 Nov  8 09:46 job-20121108094629-0001
> drwxr-xr-x 3 root root 4096 Nov  8 09:48 job-20121108094809-0002
> drwxr-xr-x 3 root root 4096 Nov  8 10:04 job-20121108100418-0003
> drwxr-xr-x 3 root root 4096 Nov  8 10:18 job-20121108101814-0004
> drwxr-xr-x 3 root root 4096 Nov  8 10:22 job-20121108102207-0005
> drwxr-xr-x 3 root root 4096 Nov  8 18:48 job-20121108184842-0006
> drwxr-xr-x 3 root root 4096 Nov  8 18:49 job-20121108184932-0007
> drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185007-0008
> drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185040-0009
> drwxr-xr-x 3 root root 4096 Nov  8 18:51 job-20121108185127-0010
> drwxr-xr-x 3 root root 4096 Nov  8 18:54 job-20121108185428-0011
> drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185837-0012
> drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185854-0013
> drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190005-0014
> drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190059-0015
> drwxr-xr-x 3 root root 4096 Nov  8 19:10 job-20121108191010-0016
> drwxr-xr-x 3 root root 4096 Nov  8 19:15 job-20121108191508-0017
> drwxr-xr-x 3 root root 4096 Nov  8 19:21 job-20121108192125-0018
> drwxr-xr-x 3 root root 4096 Nov  8 19:23 job-20121108192329-0019
> drwxr-xr-x 3 root root 4096 Nov  8 19:26 job-20121108192638-0020
> drwxr-xr-x 3 root root 4096 Nov  8 19:35 job-20121108193554-0022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1748) I installed the spark_standalone,but I did not know how to use sbt to compile the programme of spark?

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142455#comment-14142455
 ] 

Matthew Farrellee commented on SPARK-1748:
--

thanks for the question. you'll get a better response asking on the mailing 
lists, see http://spark.apache.org/community.html, so i'm going to close this 
out.

> I installed the spark_standalone,but I did not know how to use sbt to compile 
> the programme of spark?
> -
>
> Key: SPARK-1748
> URL: https://issues.apache.org/jira/browse/SPARK-1748
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 0.8.1
> Environment: spark standalone 
>Reporter: lxflyl
>
> I installed the mode of spark standalone ,but I did not understand how to use 
> sbt to compile the program of spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1748) I installed the spark_standalone,but I did not know how to use sbt to compile the programme of spark?

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-1748.

   Resolution: Done
Fix Version/s: (was: 0.8.1)

> I installed the spark_standalone,but I did not know how to use sbt to compile 
> the programme of spark?
> -
>
> Key: SPARK-1748
> URL: https://issues.apache.org/jira/browse/SPARK-1748
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 0.8.1
> Environment: spark standalone 
>Reporter: lxflyl
>
> I installed the mode of spark standalone ,but I did not understand how to use 
> sbt to compile the program of spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1176) Adding port configuration for HttpBroadcast

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-1176.
--
   Resolution: Fixed
Fix Version/s: (was: 0.9.0)
   1.1.0

> Adding port configuration for HttpBroadcast
> ---
>
> Key: SPARK-1176
> URL: https://issues.apache.org/jira/browse/SPARK-1176
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Egor Pakhomov
>Priority: Minor
> Fix For: 1.1.0
>
>
> I run spark in big organization, where to open port accessible to other 
> computers in network, I need to create a ticket on DevOps and it executes for 
> days. I can't have port for some spark service to be changed all the time. I 
> need ability to configure this port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1176) Adding port configuration for HttpBroadcast

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142454#comment-14142454
 ] 

Matthew Farrellee commented on SPARK-1176:
--

[~epakhomov] it looks like this was resolved by SPARK-2157. i'm going to close 
this, but please feel free to re-open if it is still an issue for you.

> Adding port configuration for HttpBroadcast
> ---
>
> Key: SPARK-1176
> URL: https://issues.apache.org/jira/browse/SPARK-1176
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Egor Pakhomov
>Priority: Minor
> Fix For: 0.9.0
>
>
> I run spark in big organization, where to open port accessible to other 
> computers in network, I need to create a ticket on DevOps and it executes for 
> days. I can't have port for some spark service to be changed all the time. I 
> need ability to configure this port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1177) Allow SPARK_JAR to be set in system properties

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-1177.

   Resolution: Fixed
Fix Version/s: (was: 0.9.0)

> Allow SPARK_JAR to be set in system properties
> --
>
> Key: SPARK-1177
> URL: https://issues.apache.org/jira/browse/SPARK-1177
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Egor Pakhomov
>Priority: Minor
>
> I'd like to be able to do from my scala code:
>   System.setProperty("SPARK_YARN_APP_JAR", 
> SparkContext.jarOfClass(this.getClass).head)
>   System.setProperty("SPARK_JAR", 
> SparkContext.jarOfClass(SparkContext.getClass).head)
> And do nothing on OS level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1177) Allow SPARK_JAR to be set in system properties

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142447#comment-14142447
 ] 

Matthew Farrellee commented on SPARK-1177:
--

[~epakhomov] it looks like this has been resolved in other change, for instance 
being able to use spark.yarn.jar. i'm going to close this, but feel free to 
re-open if you think it is still important.

> Allow SPARK_JAR to be set in system properties
> --
>
> Key: SPARK-1177
> URL: https://issues.apache.org/jira/browse/SPARK-1177
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Egor Pakhomov
>Priority: Minor
> Fix For: 0.9.0
>
>
> I'd like to be able to do from my scala code:
>   System.setProperty("SPARK_YARN_APP_JAR", 
> SparkContext.jarOfClass(this.getClass).head)
>   System.setProperty("SPARK_JAR", 
> SparkContext.jarOfClass(SparkContext.getClass).head)
> And do nothing on OS level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1443) Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142446#comment-14142446
 ] 

Matthew Farrellee commented on SPARK-1443:
--

[~PavanKumarVarma] i hope you've been able to resolve your issue over the past 
5 months. since you'll get a better response asking on the spark user list than 
in jira, see http://spark.apache.org/community.html, i'm going to close this 
out.

> Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API
> --
>
> Key: SPARK-1443
> URL: https://issues.apache.org/jira/browse/SPARK-1443
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API, Spark Core
>Affects Versions: 0.9.0
> Environment: Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4,
>Reporter: Pavan Kumar Varma
>Priority: Critical
>  Labels: GridFS, MongoDB, Spark, hadoop2, java
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> I saved a 2GB pdf file into MongoDB using GridFS. now i want process those 
> GridFS collection data using Java Spark Mapreduce API. previously i have 
> successfully processed mongoDB collections with Apache spark using 
> Mongo-Hadoop connector. now i'm unable to GridFS collections with the 
> following code.
> MongoConfigUtil.setInputURI(config, 
> "mongodb://localhost:27017/pdfbooks.fs.chunks" );
>  MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
>  JavaPairRDD mongoRDD = sc.newAPIHadoopRDD(config,
> com.mongodb.hadoop.MongoInputFormat.class, Object.class,
> BSONObject.class);
>  JavaRDD words = mongoRDD.flatMap(new 
> FlatMapFunction,
>String>() {
>@Override
>public Iterable call(Tuple2 arg) {   
>System.out.println(arg._2.toString());
>...
> Please suggest/provide  better API methods to access MongoDB GridFS data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1443) Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-1443.
--
   Resolution: Done
Fix Version/s: (was: 0.9.0)

> Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API
> --
>
> Key: SPARK-1443
> URL: https://issues.apache.org/jira/browse/SPARK-1443
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API, Spark Core
>Affects Versions: 0.9.0
> Environment: Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4,
>Reporter: Pavan Kumar Varma
>Priority: Critical
>  Labels: GridFS, MongoDB, Spark, hadoop2, java
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> I saved a 2GB pdf file into MongoDB using GridFS. now i want process those 
> GridFS collection data using Java Spark Mapreduce API. previously i have 
> successfully processed mongoDB collections with Apache spark using 
> Mongo-Hadoop connector. now i'm unable to GridFS collections with the 
> following code.
> MongoConfigUtil.setInputURI(config, 
> "mongodb://localhost:27017/pdfbooks.fs.chunks" );
>  MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
>  JavaPairRDD mongoRDD = sc.newAPIHadoopRDD(config,
> com.mongodb.hadoop.MongoInputFormat.class, Object.class,
> BSONObject.class);
>  JavaRDD words = mongoRDD.flatMap(new 
> FlatMapFunction,
>String>() {
>@Override
>public Iterable call(Tuple2 arg) {   
>System.out.println(arg._2.toString());
>...
> Please suggest/provide  better API methods to access MongoDB GridFS data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142445#comment-14142445
 ] 

Apache Spark commented on SPARK-3580:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2478

> Add Consistent Method To Get Number of RDD Partitions Across Different 
> Languages
> 
>
> Key: SPARK-3580
> URL: https://issues.apache.org/jira/browse/SPARK-3580
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Pat McDonough
>  Labels: starter
>
> Programmatically retrieving the number of partitions is not consistent 
> between python and scala. A consistent method should be defined and made 
> public across both languages.
> RDD.partitions.size is also used quite frequently throughout the internal 
> code, so that might be worth refactoring as well once the new method is 
> available.
> What we have today is below.
> In Scala:
> {code}
> scala> someRDD.partitions.size
> res0: Int = 30
> {code}
> In Python:
> {code}
> In [2]: someRDD.getNumPartitions()
> Out[2]: 30
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3624) "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages

2014-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142412#comment-14142412
 ] 

Apache Spark commented on SPARK-3624:
-

User 'tzolov' has created a pull request for this issue:
https://github.com/apache/spark/pull/2477

> "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian 
> packages
> 
>
> Key: SPARK-3624
> URL: https://issues.apache.org/jira/browse/SPARK-3624
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 1.1.0
>Reporter: Christian Tzolov
>Priority: Minor
>
> The compute-classpath.sh requires that for a 'RELASED' package the Spark 
> assembly jar is accessible from a /lib folder.
> Currently the jdeb packaging (assembly module) bundles the assembly jar into 
> a folder called 'jars'. 
> The result is :
> /usr/share/spark/bin/spark-submit   --num-executors 10--master 
> yarn-cluster   --class org.apache.spark.examples.SparkPi   
> /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10
> ls: cannot access /usr/share/spark/lib: No such file or directory
> Failed to find Spark assembly in /usr/share/spark/lib
> You need to build Spark before running this program.
> Trivial solution is to rename the '${deb.install.path}/jars' 
> inside assembly/pom.xml to ${deb.install.path}/lib.
> Another less impactful (considering backward compatibility) solution is to 
> define a lib->jars symlink in the assembly/pom.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3624) "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages

2014-09-21 Thread Christian Tzolov (JIRA)
Christian Tzolov created SPARK-3624:
---

 Summary: "Failed to find Spark assembly in /usr/share/spark/lib" 
for RELEASED debian packages
 Key: SPARK-3624
 URL: https://issues.apache.org/jira/browse/SPARK-3624
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 1.1.0
Reporter: Christian Tzolov
Priority: Minor


The compute-classpath.sh requires that for a 'RELASED' package the Spark 
assembly jar is accessible from a /lib folder.

Currently the jdeb packaging (assembly module) bundles the assembly jar into a 
folder called 'jars'. 

The result is :
/usr/share/spark/bin/spark-submit   --num-executors 10--master yarn-cluster 
  --class org.apache.spark.examples.SparkPi   
/usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10
ls: cannot access /usr/share/spark/lib: No such file or directory
Failed to find Spark assembly in /usr/share/spark/lib
You need to build Spark before running this program.

Trivial solution is to rename the '${deb.install.path}/jars' 
inside assembly/pom.xml to ${deb.install.path}/lib.

Another less impactful (considering backward compatibility) solution is to 
define a lib->jars symlink in the assembly/pom.xml




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-21 Thread Dale Richardson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dale Richardson updated SPARK-3620:
---
Description: 
I'm proposing its time to refactor the configuration argument handling code in 
spark-submit. The code has grown organically in a short period of time, handles 
a pretty complicated logic flow, and is now pretty fragile. Some issues that 
have been identified:

1. Hand-crafted property file readers that do not support the property file 
format as specified in 
http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)

2. ResolveURI not called on paths read from conf/prop files

3. inconsistent means of merging / overriding values from different sources 
(Some get overridden by file, others by manual settings of field on object, 
Some by properties)

4. Argument validation should be done after combining config files, system 
properties and command line arguments, 

5. Alternate conf file location not handled in shell scripts

6. Some options can only be passed as command line arguments

7. Defaults for options are hard-coded (and sometimes overridden multiple 
times) in many through-out the code e.g. master = local[*]

Initial proposal is to use typesafe conf to read in the config information and 
merge the various config sources

  was:
I'm proposing its time to refactor the configuration argument handling code in 
spark-submit. The code has grown organically in a short period of time, handles 
a pretty complicated logic flow, and is now pretty fragile. Some issues that 
have been identified:

1. Hand-crafted property file readers that do not support the property file 
format as specified in 
http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)

2. ResolveURI not called on paths read from conf/prop files

3. inconsistent means of merging / overriding values from different sources 
(Some get overridden by file, others by manual settings of field on object, 
Some by properties)

4. Argument validation should be done after combining config files, system 
properties and command line arguments, 

5. Alternate conf file location not handled in shell scripts

6. Some options can only be passed as command line arguments

7. Defaults for options are hard-coded (and sometimes overridden multiple 
times) in many through-out the code e.g. master = local[*]


> Refactor config option handling code for spark-submit
> -
>
> Key: SPARK-3620
> URL: https://issues.apache.org/jira/browse/SPARK-3620
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Dale Richardson
>Assignee: Dale Richardson
>Priority: Minor
>
> I'm proposing its time to refactor the configuration argument handling code 
> in spark-submit. The code has grown organically in a short period of time, 
> handles a pretty complicated logic flow, and is now pretty fragile. Some 
> issues that have been identified:
> 1. Hand-crafted property file readers that do not support the property file 
> format as specified in 
> http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
> 2. ResolveURI not called on paths read from conf/prop files
> 3. inconsistent means of merging / overriding values from different sources 
> (Some get overridden by file, others by manual settings of field on object, 
> Some by properties)
> 4. Argument validation should be done after combining config files, system 
> properties and command line arguments, 
> 5. Alternate conf file location not handled in shell scripts
> 6. Some options can only be passed as command line arguments
> 7. Defaults for options are hard-coded (and sometimes overridden multiple 
> times) in many through-out the code e.g. master = local[*]
> Initial proposal is to use typesafe conf to read in the config information 
> and merge the various config sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-21 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142394#comment-14142394
 ] 

Dale Richardson edited comment on SPARK-3620 at 9/21/14 9:30 AM:
-

Seems to be discussion about moving to typesafe config and back again for 
version 0.9
http://apache-spark-developers-list.1001551.n3.nabble.com/Moving-to-Typesafe-Config-td381.html


was (Author: tigerquoll):
Seems to be discussion about moving to typesafe config and back again at
http://apache-spark-developers-list.1001551.n3.nabble.com/Moving-to-Typesafe-Config-td381.html

> Refactor config option handling code for spark-submit
> -
>
> Key: SPARK-3620
> URL: https://issues.apache.org/jira/browse/SPARK-3620
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Dale Richardson
>Assignee: Dale Richardson
>Priority: Minor
>
> I'm proposing its time to refactor the configuration argument handling code 
> in spark-submit. The code has grown organically in a short period of time, 
> handles a pretty complicated logic flow, and is now pretty fragile. Some 
> issues that have been identified:
> 1. Hand-crafted property file readers that do not support the property file 
> format as specified in 
> http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
> 2. ResolveURI not called on paths read from conf/prop files
> 3. inconsistent means of merging / overriding values from different sources 
> (Some get overridden by file, others by manual settings of field on object, 
> Some by properties)
> 4. Argument validation should be done after combining config files, system 
> properties and command line arguments, 
> 5. Alternate conf file location not handled in shell scripts
> 6. Some options can only be passed as command line arguments
> 7. Defaults for options are hard-coded (and sometimes overridden multiple 
> times) in many through-out the code e.g. master = local[*]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-21 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142395#comment-14142395
 ] 

Dale Richardson commented on SPARK-3620:


Also somes notes about config properties that do not have unique prefixes at
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

Its seems the following options have non-unique prefixes, which means that some 
typesafe conf functionality may be broken
spark.locality.wait 
spark.locality.wait.node 
spark.locality.wait.process 
spark.locality.wait.rack 

spark.speculation 
spark.speculation.interval 
spark.speculation.multiplier 
spark.speculation.quantile 

> Refactor config option handling code for spark-submit
> -
>
> Key: SPARK-3620
> URL: https://issues.apache.org/jira/browse/SPARK-3620
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Dale Richardson
>Assignee: Dale Richardson
>Priority: Minor
>
> I'm proposing its time to refactor the configuration argument handling code 
> in spark-submit. The code has grown organically in a short period of time, 
> handles a pretty complicated logic flow, and is now pretty fragile. Some 
> issues that have been identified:
> 1. Hand-crafted property file readers that do not support the property file 
> format as specified in 
> http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
> 2. ResolveURI not called on paths read from conf/prop files
> 3. inconsistent means of merging / overriding values from different sources 
> (Some get overridden by file, others by manual settings of field on object, 
> Some by properties)
> 4. Argument validation should be done after combining config files, system 
> properties and command line arguments, 
> 5. Alternate conf file location not handled in shell scripts
> 6. Some options can only be passed as command line arguments
> 7. Defaults for options are hard-coded (and sometimes overridden multiple 
> times) in many through-out the code e.g. master = local[*]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-21 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142394#comment-14142394
 ] 

Dale Richardson commented on SPARK-3620:


Seems to be discussion about moving to typesafe config and back again at
http://apache-spark-developers-list.1001551.n3.nabble.com/Moving-to-Typesafe-Config-td381.html

> Refactor config option handling code for spark-submit
> -
>
> Key: SPARK-3620
> URL: https://issues.apache.org/jira/browse/SPARK-3620
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Dale Richardson
>Assignee: Dale Richardson
>Priority: Minor
>
> I'm proposing its time to refactor the configuration argument handling code 
> in spark-submit. The code has grown organically in a short period of time, 
> handles a pretty complicated logic flow, and is now pretty fragile. Some 
> issues that have been identified:
> 1. Hand-crafted property file readers that do not support the property file 
> format as specified in 
> http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
> 2. ResolveURI not called on paths read from conf/prop files
> 3. inconsistent means of merging / overriding values from different sources 
> (Some get overridden by file, others by manual settings of field on object, 
> Some by properties)
> 4. Argument validation should be done after combining config files, system 
> properties and command line arguments, 
> 5. Alternate conf file location not handled in shell scripts
> 6. Some options can only be passed as command line arguments
> 7. Defaults for options are hard-coded (and sometimes overridden multiple 
> times) in many through-out the code e.g. master = local[*]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3623) Graph should support the checkpoint operation

2014-09-21 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3623:
---
Summary: Graph should support the checkpoint operation  (was: GraphX does 
not support the checkpoint operation)

> Graph should support the checkpoint operation
> -
>
> Key: SPARK-3623
> URL: https://issues.apache.org/jira/browse/SPARK-3623
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Priority: Critical
>
> Consider the following code:
> {code}
> for (i <- 0 until totalIter) {
>   val previousCorpus = corpus
>   logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter))
>   val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, 
> sumTerms,
> numTerms, numTopics, alpha, beta).persist(storageLevel)
>   val corpusSampleTopics = sampleTopics(corpusTopicDist, 
> globalTopicCounter, sumTerms, numTerms,
> numTopics, alpha, beta).persist(storageLevel)
>   corpus = updateCounter(corpusSampleTopics, 
> numTopics).persist(storageLevel)
>   globalTopicCounter = collectGlobalCounter(corpus, numTopics)
>   assert(bsum(globalTopicCounter) == sumTerms)
>   previousCorpus.unpersistVertices()
>   corpusTopicDist.unpersistVertices()
>   corpusSampleTopics.unpersistVertices()
> }
> {code}
> If there is no checkpoint operation will appear the following problems.
> 1. The RDD of corpus dependencies are too deep
> 2. The shuffle files are too large.
> 3. Any of a server crash will cause the algorithm to recalculate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142383#comment-14142383
 ] 

Sean Owen commented on SPARK-3621:
--

My understanding is that this is fairly fundamentally not possible in Spark. 
The metadata and machinery necessary to operate on RDDs is with the driver. 
RDDs are not accessible within transformations or actions. I'm interested both 
in whether that is in fact true, how much of an issue it really is for 
Hive-on-Spark to use collect + broadcast, and whether these sorts of 
requirements can be met with join, cogroup, etc.

> Provide a way to broadcast an RDD (instead of just a variable made of the 
> RDD) so that a job can access
> ---
>
> Key: SPARK-3621
> URL: https://issues.apache.org/jira/browse/SPARK-3621
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Xuefu Zhang
>
> In some cases, such as Hive's way of doing map-side join, it would be 
> benefcial to allow client program to broadcast RDDs rather than just 
> variables made of these RDDs. Broadcasting a variable made of RDDs requires 
> all RDD data be collected to the driver and that the variable be shipped to 
> the cluster after being made. It would be more performing if driver just 
> broadcasts the RDDs and uses the corresponding data in jobs (such building 
> hashmaps at executors).
> Tez has a broadcast edge which can ship data from previous stage to the next 
> stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3623) GraphX does not support the checkpoint operation

2014-09-21 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3623:
---
Description: 
Consider the following code:
{code}
for (i <- 0 until totalIter) {
  val previousCorpus = corpus
  logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter))

  val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, 
sumTerms,
numTerms, numTopics, alpha, beta).persist(storageLevel)

  val corpusSampleTopics = sampleTopics(corpusTopicDist, 
globalTopicCounter, sumTerms, numTerms,
numTopics, alpha, beta).persist(storageLevel)

  corpus = updateCounter(corpusSampleTopics, 
numTopics).persist(storageLevel)

  globalTopicCounter = collectGlobalCounter(corpus, numTopics)
  assert(bsum(globalTopicCounter) == sumTerms)
  previousCorpus.unpersistVertices()
  corpusTopicDist.unpersistVertices()
  corpusSampleTopics.unpersistVertices()
}
{code}
If there is no checkpoint operation will appear the following problems.
1. The RDD of corpus dependencies are too deep
2. The shuffle files are too large.
3. Any of a server crash will cause the algorithm to recalculate

  was:
Consider the following code:
{code}
for (i <- 0 until totalIter) {
  val previousCorpus = corpus
  logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter))

  val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, 
sumTerms,
numTerms, numTopics, alpha, beta).persist(storageLevel)

  val corpusSampleTopics = sampleTopics(corpusTopicDist, 
globalTopicCounter, sumTerms, numTerms,
numTopics, alpha, beta).persist(storageLevel)

  corpus = updateCounter(corpusSampleTopics, 
numTopics).persist(storageLevel)

  globalTopicCounter = collectGlobalCounter(corpus, numTopics)
  assert(bsum(globalTopicCounter) == sumTerms)
  previousCorpus.unpersistVertices()
  corpusTopicDist.unpersistVertices()
  corpusSampleTopics.unpersistVertices()
}
{code}
If there is no checkpoint operation will appear the following problems.
1. The RDD of corpus dependencies are too deep
2. The shuffle files are too large.


> GraphX does not support the checkpoint operation
> 
>
> Key: SPARK-3623
> URL: https://issues.apache.org/jira/browse/SPARK-3623
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Priority: Critical
>
> Consider the following code:
> {code}
> for (i <- 0 until totalIter) {
>   val previousCorpus = corpus
>   logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter))
>   val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, 
> sumTerms,
> numTerms, numTopics, alpha, beta).persist(storageLevel)
>   val corpusSampleTopics = sampleTopics(corpusTopicDist, 
> globalTopicCounter, sumTerms, numTerms,
> numTopics, alpha, beta).persist(storageLevel)
>   corpus = updateCounter(corpusSampleTopics, 
> numTopics).persist(storageLevel)
>   globalTopicCounter = collectGlobalCounter(corpus, numTopics)
>   assert(bsum(globalTopicCounter) == sumTerms)
>   previousCorpus.unpersistVertices()
>   corpusTopicDist.unpersistVertices()
>   corpusSampleTopics.unpersistVertices()
> }
> {code}
> If there is no checkpoint operation will appear the following problems.
> 1. The RDD of corpus dependencies are too deep
> 2. The shuffle files are too large.
> 3. Any of a server crash will cause the algorithm to recalculate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3623) GraphX does not support the checkpoint operation

2014-09-21 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-3623:
--

 Summary: GraphX does not support the checkpoint operation
 Key: SPARK-3623
 URL: https://issues.apache.org/jira/browse/SPARK-3623
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.1.0, 1.0.2
Reporter: Guoqiang Li
Priority: Critical


Consider the following code:
{code}
for (i <- 0 until totalIter) {
  val previousCorpus = corpus
  logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter))

  val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, 
sumTerms,
numTerms, numTopics, alpha, beta).persist(storageLevel)

  val corpusSampleTopics = sampleTopics(corpusTopicDist, 
globalTopicCounter, sumTerms, numTerms,
numTopics, alpha, beta).persist(storageLevel)

  corpus = updateCounter(corpusSampleTopics, 
numTopics).persist(storageLevel)

  globalTopicCounter = collectGlobalCounter(corpus, numTopics)
  assert(bsum(globalTopicCounter) == sumTerms)
  previousCorpus.unpersistVertices()
  corpusTopicDist.unpersistVertices()
  corpusSampleTopics.unpersistVertices()
}
{code}
If there is no checkpoint operation will appear the following problems.
1. The RDD of corpus dependencies are too deep
2. The shuffle files are too large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org