date:20140915

[
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Ash updated SPARK-3526:
--
Summary: Docs section on data locality (was: Section on data locality)

Docs section on data locality
-

Key: SPARK-3526
URL: https://issues.apache.org/jira/browse/SPARK-3526
Project: Spark
Issue Type: Documentation
Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash

Several threads on the mailing list have been about data locality and how to
interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more
details in the docs on this concept so we can point future questions there.
A couple people appreciated the below description of locality so it could be
a good starting point:
{quote}
The locality is how close the data is to the code that's processing it.
PROCESS_LOCAL means data is in the same JVM as the code that's running, so
it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same
node, or in another executor on the same node, so is a little slower because
the data has to travel across an IPC connection. RACK_LOCAL is even slower
-- data is on a different server so needs to be sent over the network.
Spark switches to lower locality levels when there's no unprocessed data on a
node that has idle CPUs. In that situation you have two options: wait until
the busy CPUs free up so you can start another task that uses data on that
server, or start a new task on a farther away server that needs to bring data
from that remote place. What Spark typically does is wait a bit in the hopes
that a busy CPU frees up. Once that timeout expires, it starts moving the
data from far away to the free CPU.
{quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3526) Docs section on data locality


[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133662#comment-14133662
 ] 

Andrew Ash commented on SPARK-3526:
---

Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3527) Strip the physical plan message margin

2014-09-15 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-3527:


 Summary: Strip the physical plan message margin
 Key: SPARK-3527
 URL: https://issues.apache.org/jira/browse/SPARK-3527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3527) Strip the physical plan message margin


[ 
https://issues.apache.org/jira/browse/SPARK-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133669#comment-14133669
 ] 

Apache Spark commented on SPARK-3527:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2392

 Strip the physical plan message margin
 --

 Key: SPARK-3527
 URL: https://issues.apache.org/jira/browse/SPARK-3527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

Andrew Ash created SPARK-3528:
-

 Summary: Reading data from file:/// should be called NODE_LOCAL 
not PROCESS_LOCAL
 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash


Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task

{noformat}
14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:20862+20863
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:0+20862
{noformat}

There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:

{noformat}
  override def getPreferredLocations(split: Partition): Seq[String] = {
// TODO: Filtering out localhost in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopPartition]
hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
  }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3526) Docs section on data locality


[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133662#comment-14133662
 ] 

Andrew Ash edited comment on SPARK-3526 at 9/15/14 8:14 AM:


Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

Edit: repro'd and filed as SPARK-3528


was (Author: aash):
Note: reports from users that reading from {{file://}} may be logged as 
{{PROCESS_LOCAL}} ?

 Docs section on data locality
 -

 Key: SPARK-3526
 URL: https://issues.apache.org/jira/browse/SPARK-3526
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Several threads on the mailing list have been about data locality and how to 
 interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
 details in the docs on this concept so we can point future questions there.
 A couple people appreciated the below description of locality so it could be 
 a good starting point:
 {quote}
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL


 [ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3528:
--
Description: 
Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task

{noformat}
spark sc.textFile(pom.xml).count
...
14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:20862+20863
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:0+20862
{noformat}

There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:

{noformat}
  override def getPreferredLocations(split: Partition): Seq[String] = {
// TODO: Filtering out localhost in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopPartition]
hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
  }
{noformat}

  was:
Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task

{noformat}
14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1191 bytes)
14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:20862+20863
14/09/15 00:59:13 INFO HadoopRDD: Input split: 
file:/Users/aash/git/spark/pom.xml:0+20862
{noformat}

There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:

{noformat}
  override def getPreferredLocations(split: Partition): Seq[String] = {
// TODO: Filtering out localhost in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopPartition]
hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
  }
{noformat}


 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3529) Delete the temporal files after test exit

2014-09-15 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-3529:


 Summary: Delete the temporal files after test exit
 Key: SPARK-3529
 URL: https://issues.apache.org/jira/browse/SPARK-3529
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3529) Delete the temporal files after test exit


[ 
https://issues.apache.org/jira/browse/SPARK-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133683#comment-14133683
 ] 

Apache Spark commented on SPARK-3529:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2393

 Delete the temporal files after test exit
 -

 Key: SPARK-3529
 URL: https://issues.apache.org/jira/browse/SPARK-3529
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3530) Pipeline and Parameters

2014-09-15 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-3530:


 Summary: Pipeline and Parameters
 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


This part of the design doc is for pipelines and parameters. I put the design 
doc at

https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

I will copy the proposed interfaces to this JIRA later. Some sample code can be 
viewed at: https://github.com/mengxr/spark-ml/

Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-15 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133694#comment-14133694
 ] 

Saisai Shao commented on SPARK-2926:


Hey [~rxin], here is the branch rebased on your code 
(https://github.com/jerryshao/apache-spark/tree/sort-shuffle-read-new-netty), 
mind taking a look at it? Thanks a lot.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
 Report(contd).pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3521) Missing modules in 1.1.0 source distribution - cant be build with maven

2014-09-15 Thread Radim Kolar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar closed SPARK-3521.
--
   Resolution: Not a Problem
Fix Version/s: 1.1.1

Compile problem is fixed on github branch-1.1

 Missing modules in 1.1.0 source distribution - cant be build with maven
 ---

 Key: SPARK-3521
 URL: https://issues.apache.org/jira/browse/SPARK-3521
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Radim Kolar
Priority: Minor
 Fix For: 1.1.1


 modules {{bagel}}, {{mllib}}, {{flume-sink}} and {{flume}} are missing from 
 source code distro, spark cant be build with maven. It cant be build by 
 {{sbt/sbt}} either due to other bug (_java.lang.IllegalStateException: 
 impossible to get artifacts when data has not been loaded. IvyNode = 
 org.slf4j#slf4j-api;1.6.1_)
 (hsn@sanatana:pts/6):work/spark-1.1.0% mvn -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.4.1 -DskipTests clean package
 [INFO] Scanning for projects...
 [ERROR] The build could not read 1 project - [Help 1]
 [ERROR]   
 [ERROR]   The project org.apache.spark:spark-parent:1.1.0 
 (/home/hsn/myports/spark11/work/spark-1.1.0/pom.xml) has 4 errors
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/bagel of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module /home/hsn/myports/spark11/work/spark-1.1.0/mllib of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist
 [ERROR] Child module 
 /home/hsn/myports/spark11/work/spark-1.1.0/external/flume-sink/pom.xml of 
 /home/hsn/myports/spark11/work/spark-1.1.0/pom.xml does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3531) select null from table would throw a MatchError

2014-09-15 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-3531:
--

 Summary: select null from table would throw a MatchError
 Key: SPARK-3531
 URL: https://issues.apache.org/jira/browse/SPARK-3531
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


select null from src limit 1 will lead to a scala.MatchError



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3531) select null from table would throw a MatchError


[ 
https://issues.apache.org/jira/browse/SPARK-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133728#comment-14133728
 ] 

Apache Spark commented on SPARK-3531:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2396

 select null from table would throw a MatchError
 ---

 Key: SPARK-3531
 URL: https://issues.apache.org/jira/browse/SPARK-3531
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 select null from src limit 1 will lead to a scala.MatchError



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

[
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133760#comment-14133760
]

Sean Owen commented on SPARK-3530:
--

A few high-level questions:

Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the
algorithms will come along, but in a fairly different form. I think that's
actually a good thing. But is this targeted at a 2.x release, or sooner?

How does this relate to MLI and MLbase? I had thought they would in theory
handle things like grid-search, but haven't seen activity or mention of these
in a while. Is this at all a merge of the two or is MLlib going to take over
these concerns?

I don't think you will need or want to use this code, but the oryx project
already has an implementation of grid search on Spark. At least another take on
the API for such a thing to consider.
https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param

Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also
intrigued by doing better than trying every possible combination of parameters
separately, and maybe sharing partial results to speed up several models'
training. Is this realistic for any parameters besides things like #
iterations? which isn't really a hyperparam. I don't know, for example, ways to
build N models with N different overfitting params and share some work. I would
love to know that's possible. Good to design for it anyway.

I see mention of a Dataset abstraction, which I'm assuming contains some type
information, like distinguishing categorical and numeric features. I think
that's very good!

I've always found the 'pipeline' part hard to build. It's tempting to construct
a framework for feature extraction. To some degree you can by providing
transformations, 1-hot encoding, etc. But I think that a framework for
understanding arbitrary databases and fields and so on quickly becomes too
endlessly large a scope. Spark Core to me is already the right abstraction for
upstream ETL of data before entering an ML framework. I mention it just
because it's in the first picture, but I don't see discussion of actually doing
user/product attribute selection later. So maybe it's not meant to be part of
the proposal.

I'd certainly like to keep up more with your work here. This is a big step
forward in making MLlib more relevant to production deployments rather than
just pure algorithms implementations.

Pipeline and Parameters
---

Key: SPARK-3530
URL: https://issues.apache.org/jira/browse/SPARK-3530
Project: Spark
Issue Type: Sub-task
Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

This part of the design doc is for pipelines and parameters. I put the design
doc at
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
I will copy the proposed interfaces to this JIRA later. Some sample code can
be viewed at: https://github.com/mengxr/spark-ml/
Please help review the design and post your comments here. Thanks!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...


[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133776#comment-14133776
 ] 

Apache Spark commented on SPARK-2594:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2397

 Add CACHE TABLE name AS SELECT ...
 

 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.

2014-09-15 Thread Prashant Sharma (JIRA)

Prashant Sharma created SPARK-3532:
--

 Summary: Spark On FreeBSD. Snappy used by torrent broadcast fails 
to load native libs.
 Key: SPARK-3532
 URL: https://issues.apache.org/jira/browse/SPARK-3532
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Priority: Minor


While trying out spark on freebsd, this seemed like first blocker. 

Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress  false




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.

2014-09-15 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-3532:
---
Description: 
While trying out spark on freebsd, this seemed like first blocker. 

Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress  false
Even better workaround is set:
spark.io.compression.codec  lzf


  was:
While trying out spark on freebsd, this seemed like first blocker. 

Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress  false



 Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
 -

 Key: SPARK-3532
 URL: https://issues.apache.org/jira/browse/SPARK-3532
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Priority: Minor

 While trying out spark on freebsd, this seemed like first blocker. 
 Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress
 false
 Even better workaround is set:
 spark.io.compression.codeclzf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3532) Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.

2014-09-15 Thread Radim Kolar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133821#comment-14133821
 ] 

Radim Kolar commented on SPARK-3532:


you need to grab snappy native library from _snappy-java_ freebsd port, its not 
included in maven central JAR.

{quote}
(hsn@sanatana:pts/8):~% pkg info -l snappyjava
snappyjava-1.0.4.1_1:
/usr/local/lib/libsnappyjava.so
/usr/local/share/java/classes/snappy-java.jar
{quote}

 Spark On FreeBSD. Snappy used by torrent broadcast fails to load native libs.
 -

 Key: SPARK-3532
 URL: https://issues.apache.org/jira/browse/SPARK-3532
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Priority: Minor

 While trying out spark on freebsd, this seemed like first blocker. 
 Workaround: In conf/spark-defaults.conf, Set spark.broadcast.compress
 false
 Even better workaround is set:
 spark.io.compression.codeclzf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal

2014-09-15 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3410.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

 The priority of shutdownhook for ApplicationMaster should not be integer 
 literal
 

 Key: SPARK-3410
 URL: https://issues.apache.org/jira/browse/SPARK-3410
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.2.0


 In ApplicationMaster, the priority of shutdown hook is set to 30, which 
 expects higher than the priority of o.a.h.FileSystem.
 In FileSystem, the priority of shutdown hook is expressed as public constant 
 named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant 
 for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3396) Change LogistricRegressionWithSGD's default regType to L2


[ 
https://issues.apache.org/jira/browse/SPARK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133917#comment-14133917
 ] 

Apache Spark commented on SPARK-3396:
-

User 'BigCrunsh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2398

 Change LogistricRegressionWithSGD's default regType to L2
 -

 Key: SPARK-3396
 URL: https://issues.apache.org/jira/browse/SPARK-3396
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Christoph Sawade

 The default updater is SimpleUpdater, which doesn't add any regularization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2014-09-15 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133927#comment-14133927
 ] 

Nicholas Chammas commented on SPARK-3528:
-

[~aash] - How about for data read from S3? I see that being marked as 
{{PROCESS_LOCAL}} as well. 

{code}
 sc.textFile('s3n://...').count()
14/09/15 10:12:20 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1242 bytes)
14/09/15 10:12:20 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
localhost, PROCESS_LOCAL, 1242 bytes)
{code}


 Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
 

 Key: SPARK-3528
 URL: https://issues.apache.org/jira/browse/SPARK-3528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash

 Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
 {noformat}
 spark sc.textFile(pom.xml).count
 ...
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
 localhost, PROCESS_LOCAL, 1191 bytes)
 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:20862+20863
 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
 file:/Users/aash/git/spark/pom.xml:0+20862
 {noformat}
 There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
 {noformat}
   override def getPreferredLocations(split: Partition): Seq[String] = {
 // TODO: Filtering out localhost in case of file:// URLs
 val hadoopSplit = split.asInstanceOf[HadoopPartition]
 hadoopSplit.inputSplit.value.getLocations.filter(_ != localhost)
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3526) Docs section on data locality

2014-09-15 Thread Nicholas Chammas (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133935#comment-14133935
]

Nicholas Chammas commented on SPARK-3526:
-

FYI: Looks like the valid localities are [enumerated
here|https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/core/src/main/scala/org/apache/spark/scheduler/TaskLocality.scala#L25].

Docs section on data locality
-

Key: SPARK-3526
URL: https://issues.apache.org/jira/browse/SPARK-3526
Project: Spark
Issue Type: Documentation
Components: Documentation
Affects Versions: 1.0.2
Reporter: Andrew Ash
Assignee: Andrew Ash

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable


 [ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3470.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor
 Fix For: 1.2.0


 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1895) Run tests on windows


[ 
https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133964#comment-14133964
 ] 

Sean Owen commented on SPARK-1895:
--

Can anyone still reproduce this? I know test temp file cleanup was improved in 
1.0.x, and am not sure I have heard of this since.

 Run tests on windows
 

 Key: SPARK-1895
 URL: https://issues.apache.org/jira/browse/SPARK-1895
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Windows
Affects Versions: 0.9.1
 Environment: spark-0.9.1-bin-hadoop1
Reporter: stribog
Priority: Trivial

 bin\pyspark python\pyspark\rdd.py
 Sometimes tests complete without error _.
 Last tests fail log:
 
 14/05/21 18:31:40 INFO Executor: Running task ID 321
 14/05/21 18:31:40 INFO Executor: Running task ID 324
 14/05/21 18:31:40 INFO Executor: Running task ID 322
 14/05/21 18:31:40 INFO Executor: Running task ID 323
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 324
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on 
 localhost (progress: 1/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3)
 14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, 
 finish = 0
 14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607
 14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver
 14/05/21 18:31:40 INFO Executor: Finished task ID 323
 14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on 
 localhost (progress: 2/4)
 14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 322
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on 
 localhost (progress: 3/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1)
 14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, 
 finish = 0
 14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607
 14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver
 14/05/21 18:31:41 INFO Executor: Finished task ID 321
 14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on 
 localhost (progress: 4/4)
 14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks 
 have all completed, from pool
 14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest 
 __main__.RDD.top[0]:1) finished in 1,051 s
 14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest 
 __main__.RDD.top[0]:1, took 1.053832912 s
 14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest 
 __main__.RDD.top[1]:1
 14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest 
 __main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false)
 14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest 
 __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List()
 14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List()
 14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at 
 top at doctest __main__.RDD.top[1]:1), which has no missing parents
 14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 
 (PythonRDD[213] at top at doctest __main__.RDD.top[1]:1)
 14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes 
 in 0 ms
 14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes 
 in 1 ms
 14/05/21 18:31:41 INFO Executor: Running task ID 326
 14/05/21 18:31:41 INFO Executor:

[jira] [Resolved] (SPARK-1258) RDD.countByValue optimization


 [ 
https://issues.apache.org/jira/browse/SPARK-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1258.
--
Resolution: Won't Fix

I'm taking the liberty of closing this, since this refers to an optimization 
using fastutil classes, which were removed from Spark. An equivalent 
optimization is employed now, using Spark's OpenHashMap.

 RDD.countByValue optimization
 -

 Key: SPARK-1258
 URL: https://issues.apache.org/jira/browse/SPARK-1258
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Jaroslav Kamenik
Priority: Trivial

 Class Object2LongOpenHashMap has method add(key, incr) (addTo in new version) 
 for incrementation value assigned to the key. It should be faster than 
 currently used  map.put(v, map.getLong(v) + 1L) .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest


[ 
https://issues.apache.org/jira/browse/SPARK-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133976#comment-14133976
 ] 

Sean Owen commented on SPARK-3506:
--

Yeah, I imagine that can be touched up right now. For the future, I imagine the 
issue was just that the site was built from the branch before the release 
plugin upped the version and created the artifacts? So the site might be better 
built from the final released source artifact.

I imagine it's a release-process doc change but don't know whether that lives.

 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
 --

 Key: SPARK-3506
 URL: https://issues.apache.org/jira/browse/SPARK-3506
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Jacek Laskowski
Priority: Trivial

 In https://spark.apache.org/docs/latest/ there are references to 
 1.1.0-SNAPSHOT:
 * This documentation is for Spark version 1.1.0-SNAPSHOT.
 * For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10.
 It should be version 1.1.0 since that's the latest released version and the 
 header tells so, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134096#comment-14134096
 ] 

Sean Owen commented on SPARK-2620:
--

FWIW, here is a mailing list comment that suggests 1.1 works with these case 
classes, although this is not a case where the REPL is being used:

http://apache-spark-user-list.1001560.n3.nabble.com/Compiler-issues-for-multiple-map-on-RDD-td14248.html

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-09-15 Thread Daniel Siegmann (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134101#comment-14134101
 ] 

Daniel Siegmann commented on SPARK-2620:


I have tested the case in spark-shell on Spark 1.1.0. It is still broken.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2014-09-15 Thread Gregory Phillips (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134073#comment-14134073
 ] 

Gregory Phillips edited comment on SPARK-2984 at 9/15/14 4:43 PM:
--

I'm running into this as well. But to respond to this theory:

{quote}
I think this may be related to spark.speculation. I think the error condition 
might manifest in this circumstance:
1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist! exception
{quote}

Speculation is not necessary for this to occur. I am consistently running into 
this while testing some code against local without speculation where I am 
trying to download, manipulate and merge 2 sets of data from S3 and serialize 
the resulting RDD using saveAsTextFile back to S3:

{code}
Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent 
failure: Exception failure in TID 762 on host localhost: 
java.io.FileNotFoundException: 
s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate:
 No such file or directory. 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786)
 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
org.apache.spark.scheduler.Task.run(Task.scala:51) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
java.lang.Thread.run(Thread.java:744) Driver stacktrace:
{code}

I'm happy to provide more information or help investigate further to figure 
this one out.

Edit: I forgot to mention that the file in question actually did exist on S3 
when I checked after receiving this exception.


was (Author: gphil):
I'm running into this as well. But to respond to this theory:

{quote}
I think this may be related to spark.speculation. I think the error condition 
might manifest in this circumstance:
1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist! exception
{quote}

Speculation is not necessary for this to occur. I am consistently running into 
this while testing some code against local without speculation where I am 
trying to download, manipulate and merge 2 sets of data from S3 and serialize 
the resulting RDD using saveAsTextFile back to S3:

{code}
Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent 
failure: Exception failure in TID 762 on host localhost: 
java.io.FileNotFoundException: 
s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate:
 No such file or directory. 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786)
 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
org.apache.spark.scheduler.Task.run(Task.scala:51) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
java.lang.Thread.run(Thread.java:744) Driver stacktrace:
{code}

I'm happy to provide more information or help investigate further to figure 
this one out.

 FileNotFoundException on _temporary directory

[jira] [Commented] (SPARK-2932) Move MasterFailureTest out of main source directory


[ 
https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134116#comment-14134116
 ] 

Apache Spark commented on SPARK-2932:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2399

 Move MasterFailureTest out of main source directory
 -

 Key: SPARK-2932
 URL: https://issues.apache.org/jira/browse/SPARK-2932
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Marcelo Vanzin
Priority: Trivial

 Currently, MasterFailureTest.scala lives in streaming/src/main, which means 
 it ends up in the published streaming jar.
 It's only used by other test code, and although it also provides a main() 
 entry point, that's also only usable for testing, so the code should probably 
 be moved to the test directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1895) Run tests on windows


 [ 
https://issues.apache.org/jira/browse/SPARK-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-1895:
--
Description: 
bin\pyspark python\pyspark\rdd.py

Sometimes tests complete without error _.
Last tests fail log:

{noformat}

14/05/21 18:31:40 INFO Executor: Running task ID 321
14/05/21 18:31:40 INFO Executor: Running task ID 324
14/05/21 18:31:40 INFO Executor: Running task ID 322
14/05/21 18:31:40 INFO Executor: Running task ID 323
14/05/21 18:31:40 INFO PythonRDD: Times: total = 241, boot = 240, init = 1, 
finish = 0
14/05/21 18:31:40 INFO Executor: Serialized size of result for 324 is 607
14/05/21 18:31:40 INFO Executor: Sending result for 324 directly to driver
14/05/21 18:31:40 INFO Executor: Finished task ID 324
14/05/21 18:31:40 INFO TaskSetManager: Finished TID 324 in 248 ms on localhost 
(progress: 1/4)
14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 3)
14/05/21 18:31:40 INFO PythonRDD: Times: total = 518, boot = 516, init = 2, 
finish = 0
14/05/21 18:31:40 INFO Executor: Serialized size of result for 323 is 607
14/05/21 18:31:40 INFO Executor: Sending result for 323 directly to driver
14/05/21 18:31:40 INFO Executor: Finished task ID 323
14/05/21 18:31:40 INFO TaskSetManager: Finished TID 323 in 528 ms on localhost 
(progress: 2/4)
14/05/21 18:31:40 INFO DAGScheduler: Completed ResultTask(80, 2)
14/05/21 18:31:41 INFO PythonRDD: Times: total = 776, boot = 774, init = 2, 
finish = 0
14/05/21 18:31:41 INFO Executor: Serialized size of result for 322 is 607
14/05/21 18:31:41 INFO Executor: Sending result for 322 directly to driver
14/05/21 18:31:41 INFO Executor: Finished task ID 322
14/05/21 18:31:41 INFO TaskSetManager: Finished TID 322 in 785 ms on localhost 
(progress: 3/4)
14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 1)
14/05/21 18:31:41 INFO PythonRDD: Times: total = 1043, boot = 1042, init = 1, 
finish = 0
14/05/21 18:31:41 INFO Executor: Serialized size of result for 321 is 607
14/05/21 18:31:41 INFO Executor: Sending result for 321 directly to driver
14/05/21 18:31:41 INFO Executor: Finished task ID 321
14/05/21 18:31:41 INFO TaskSetManager: Finished TID 321 in 1049 ms on localhost 
(progress: 4/4)
14/05/21 18:31:41 INFO DAGScheduler: Completed ResultTask(80, 0)
14/05/21 18:31:41 INFO TaskSchedulerImpl: Removed TaskSet 80.0, whose tasks 
have all completed, from pool
14/05/21 18:31:41 INFO DAGScheduler: Stage 80 (top at doctest 
__main__.RDD.top[0]:1) finished in 1,051 s
14/05/21 18:31:41 INFO SparkContext: Job finished: top at doctest 
__main__.RDD.top[0]:1, took 1.053832912 s
14/05/21 18:31:41 INFO SparkContext: Starting job: top at doctest 
__main__.RDD.top[1]:1
14/05/21 18:31:41 INFO DAGScheduler: Got job 63 (top at doctest 
__main__.RDD.top[1]:1) with 4 output partitions (allowLocal=false)
14/05/21 18:31:41 INFO DAGScheduler: Final stage: Stage 81 (top at doctest 
__main__.RDD.top[1]:1)
14/05/21 18:31:41 INFO DAGScheduler: Parents of final stage: List()
14/05/21 18:31:41 INFO DAGScheduler: Missing parents: List()
14/05/21 18:31:41 INFO DAGScheduler: Submitting Stage 81 (PythonRDD[213] at top 
at doctest __main__.RDD.top[1]:1), which has no missing parents
14/05/21 18:31:41 INFO DAGScheduler: Submitting 4 missing tasks from Stage 81 
(PythonRDD[213] at top at doctest __main__.RDD.top[1]:1)
14/05/21 18:31:41 INFO TaskSchedulerImpl: Adding task set 81.0 with 4 tasks
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:0 as TID 325 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:0 as 2594 bytes in 
0 ms
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:1 as TID 326 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:1 as 2594 bytes in 
0 ms
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:2 as TID 327 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:2 as 2594 bytes in 
0 ms
14/05/21 18:31:41 INFO TaskSetManager: Starting task 81.0:3 as TID 328 on 
executor localhost: localhost (PROCESS_LOCAL)
14/05/21 18:31:41 INFO TaskSetManager: Serialized task 81.0:3 as 2609 bytes in 
1 ms
14/05/21 18:31:41 INFO Executor: Running task ID 326
14/05/21 18:31:41 INFO Executor: Running task ID 328
14/05/21 18:31:41 INFO Executor: Running task ID 327
14/05/21 18:31:41 INFO Executor: Running task ID 325
14/05/21 18:31:41 INFO CacheManager: Partition rdd_212_3 not found, computing it
14/05/21 18:31:41 INFO MemoryStore: ensureFreeSpace(152) called with 
curMem=1120, maxMem=311387750
14/05/21 18:31:41 INFO MemoryStore: Block rdd_212_3 stored as values to memory 
(estimated size 152.0 B, free 297.0 MB)
14/05/21 18:31:41 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
rdd_212_3 in memory on stribog-pc:37187 (size: 152.0 B, free: 297.0 MB)
14/05/21 18:31:41 INFO

[jira] [Updated] (SPARK-1764) EOF reached before Python server acknowledged


 [ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-1764:
--
Description: 
I'm getting EOF reached before Python server acknowledged while using PySpark 
on Mesos. The error manifests itself in multiple ways. One is:

{noformat}
14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed 
due to the error EOF reached before Python server acknowledged; shutting down 
SparkContext
{noformat}

And the other has a full stacktrace:

{noformat}
14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{noformat}

This error causes the SparkContext to shutdown. I have not been able to 
reliably reproduce this bug, it seems to happen randomly, but if you run enough 
tasks on a SparkContext it'll hapen eventually

  was:
I'm getting EOF reached before Python server acknowledged while using PySpark 
on Mesos. The error manifests itself in multiple ways. One is:

14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed 
due to the error EOF reached before Python server acknowledged; shutting down 
SparkContext

And the other has a full stacktrace:

14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at

[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down


 [ 
https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2586:
--
Description: 
When you running Spark with Tachyon, when the connection to Tachyon master is 
down (due to problem in network or the Master node is down) there is no clear 
log or error message to indicate it.

Here is sample stack running SparkTachyonPi example with Tachyon connecting:

{noformat}
14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a 
loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hsaputra)
14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
14/07/15 16:43:11 INFO Remoting: Starting remoting
14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = 
ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
http://10.64.5.148:53205
14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
http://office-5-148.pa.gopivotal.com:4040
2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
SCDynamicStore
14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/15 16:43:12 INFO SparkContext: Added JAR 
examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
with timestamp 1405467792813
14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
spark://henry-pivotal.local:7077...
14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
SparkTachyonPi.scala:43
14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
SparkTachyonPi.scala:43)
14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
SparkTachyonPi.scala:39), which has no missing parents
14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(MappedRDD[1] at map at SparkTachyonPi.scala:39)
14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140715164313-
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
app-20140715164313-/0 on 
worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
(office-5-148.pa.gopivotal.com:52519) with 8 cores
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 
8 cores, 512.0 MB RAM
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: 
app-20140715164313-/0 is now RUNNING
14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256]
 with ID 0
14/07/15 16:43:15 INFO TaskSetManager: Re-computing pending task lists.
14/07/15 16:43:15 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 
0: office-5-148.pa.gopivotal.com (PROCESS_LOCAL)
14/07/15 16:43:15 INFO

[jira] [Updated] (SPARK-2586) Lack of information to figure out connection to Tachyon master is inactive/ down


 [ 
https://issues.apache.org/jira/browse/SPARK-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2586:
--
Description: 
When you running Spark with Tachyon, when the connection to Tachyon master is 
down (due to problem in network or the Master node is down) there is no clear 
log or error message to indicate it.

Here is sample stack running SparkTachyonPi example with Tachyon connecting:

{noformat}
14/07/15 16:43:10 INFO Utils: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/07/15 16:43:10 WARN Utils: Your hostname, henry-pivotal.local resolves to a 
loopback address: 127.0.0.1; using 10.64.5.148 instead (on interface en5)
14/07/15 16:43:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/07/15 16:43:11 INFO SecurityManager: Changing view acls to: hsaputra
14/07/15 16:43:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hsaputra)
14/07/15 16:43:11 INFO Slf4jLogger: Slf4jLogger started
14/07/15 16:43:11 INFO Remoting: Starting remoting
14/07/15 16:43:11 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sp...@office-5-148.pa.gopivotal.com:53203]
14/07/15 16:43:11 INFO SparkEnv: Registering MapOutputTracker
14/07/15 16:43:11 INFO SparkEnv: Registering BlockManagerMaster
14/07/15 16:43:11 INFO DiskBlockManager: Created local directory at 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-local-20140715164311-e63c
14/07/15 16:43:11 INFO ConnectionManager: Bound socket to port 53204 with id = 
ConnectionManagerId(office-5-148.pa.gopivotal.com,53204)
14/07/15 16:43:11 INFO MemoryStore: MemoryStore started with capacity 2.1 GB
14/07/15 16:43:11 INFO BlockManagerMaster: Trying to register BlockManager
14/07/15 16:43:11 INFO BlockManagerMasterActor: Registering block manager 
office-5-148.pa.gopivotal.com:53204 with 2.1 GB RAM
14/07/15 16:43:11 INFO BlockManagerMaster: Registered BlockManager
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:11 INFO HttpBroadcast: Broadcast server started at 
http://10.64.5.148:53205
14/07/15 16:43:11 INFO HttpFileServer: HTTP File server directory is 
/var/folders/nv/nsr_3ysj0wgfq93fqp0rdt3wgp/T/spark-b2fb12ae-4608-4833-87b6-b335da00738e
14/07/15 16:43:11 INFO HttpServer: Starting HTTP Server
14/07/15 16:43:12 INFO SparkUI: Started SparkUI at 
http://office-5-148.pa.gopivotal.com:4040
2014-07-15 16:43:12.210 java[39068:1903] Unable to load realm info from 
SCDynamicStore
14/07/15 16:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/15 16:43:12 INFO SparkContext: Added JAR 
examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar at 
http://10.64.5.148:53206/jars/spark-examples-1.1.0-SNAPSHOT-hadoop2.4.0.jar 
with timestamp 1405467792813
14/07/15 16:43:12 INFO AppClient$ClientActor: Connecting to master 
spark://henry-pivotal.local:7077...
14/07/15 16:43:12 INFO SparkContext: Starting job: reduce at 
SparkTachyonPi.scala:43
14/07/15 16:43:12 INFO DAGScheduler: Got job 0 (reduce at 
SparkTachyonPi.scala:43) with 2 output partitions (allowLocal=false)
14/07/15 16:43:12 INFO DAGScheduler: Final stage: Stage 0(reduce at 
SparkTachyonPi.scala:43)
14/07/15 16:43:12 INFO DAGScheduler: Parents of final stage: List()
14/07/15 16:43:12 INFO DAGScheduler: Missing parents: List()
14/07/15 16:43:12 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at 
SparkTachyonPi.scala:39), which has no missing parents
14/07/15 16:43:13 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(MappedRDD[1] at map at SparkTachyonPi.scala:39)
14/07/15 16:43:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140715164313-
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor added: 
app-20140715164313-/0 on 
worker-20140715164009-office-5-148.pa.gopivotal.com-52519 
(office-5-148.pa.gopivotal.com:52519) with 8 cores
14/07/15 16:43:13 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140715164313-/0 on hostPort office-5-148.pa.gopivotal.com:52519 with 
8 cores, 512.0 MB RAM
14/07/15 16:43:13 INFO AppClient$ClientActor: Executor updated: 
app-20140715164313-/0 is now RUNNING
14/07/15 16:43:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkexecu...@office-5-148.pa.gopivotal.com:53213/user/Executor#-423405256]
 with ID 0
14/07/15 16:43:15 INFO TaskSetManager: Re-computing pending task lists.
14/07/15 16:43:15 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 
0: office-5-148.pa.gopivotal.com (PROCESS_LOCAL)
14/07/15 16:43:15 INFO

[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles