[jira] [Closed] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"

2014-09-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-3738.
-
Resolution: Invalid

False alarm, it's because of Hive's default SerDe, which uses '\n' as record 
delimiter.

> InsertIntoHiveTable can't handle strings with "\n"
> --
>
> Key: SPARK-3738
> URL: https://issues.apache.org/jira/browse/SPARK-3738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> Try the following snippet in {{sbt/sbt hive/console}} to reproduce:
> {code}
> sql("drop table if exists z")
> case class Str(s: String)
> sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z")
> table("z").count()
> {code}
> Expected result should be 1, but 2 is returned instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment

2014-09-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-3734:
--

> DriverRunner should not read SPARK_HOME from submitter's environment
> 
>
> Key: SPARK-3734
> URL: https://issues.apache.org/jira/browse/SPARK-3734
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.1, 1.2.0
>
>
> If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
> Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
> submitting machine, then DriverRunner will attempt to use the _submitter's_ 
> JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
> which can cause the job to fail unless the submitter and worker have Java 
> installed in the same location.
> This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
> {{command.environment}}; PR pending shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment

2014-09-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3734.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0

> DriverRunner should not read SPARK_HOME from submitter's environment
> 
>
> Key: SPARK-3734
> URL: https://issues.apache.org/jira/browse/SPARK-3734
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.1, 1.2.0
>
>
> If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
> Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
> submitting machine, then DriverRunner will attempt to use the _submitter's_ 
> JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
> which can cause the job to fail unless the submitter and worker have Java 
> installed in the same location.
> This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
> {{command.environment}}; PR pending shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment

2014-09-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3734.

Resolution: Fixed

> DriverRunner should not read SPARK_HOME from submitter's environment
> 
>
> Key: SPARK-3734
> URL: https://issues.apache.org/jira/browse/SPARK-3734
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
> Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
> submitting machine, then DriverRunner will attempt to use the _submitter's_ 
> JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
> which can cause the job to fail unless the submitter and worker have Java 
> installed in the same location.
> This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
> {{command.environment}}; PR pending shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3709) Executors don't always report broadcast block removal properly back to the driver

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152845#comment-14152845
 ] 

Apache Spark commented on SPARK-3709:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2591

> Executors don't always report broadcast block removal properly back to the 
> driver
> -
>
> Key: SPARK-3709
> URL: https://issues.apache.org/jira/browse/SPARK-3709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152839#comment-14152839
 ] 

Aaron Davidson commented on SPARK-1860:
---

The Executor could clean up its own jars when it terminates normally, that 
seems fine. The impact of this seems limited, though, and it's a good idea to 
limit the scope of shutdown hooks as much as possible.

There are three classes of things to delete:
1. Shuffle files / block manager blocks -- large -- deleted by graceful 
Executor termination. Can be deleted immediately.
2. Uploaded jars / files -- usually small -- deleted by Worker cleanup. Can be 
deleted immediately.
3. Logs -- small to medium -- deleted by Worker cleanup. Should not be deleted 
immediately.

Number 1 is most critical in terms of impact on the system. Numbers 2 and 3 are 
of the same order of magnitude in size, so cleaning up 2 and not 3 is not 
expected to improve the system's stability by more than a factor of ~2x 
applications.

Note that the intentions of this particular JIRA are very simple: cleanup 2 and 
3 for all executors several days after they have terminated, rather than after 
they have started. If you wish to expand the scope of the Worker or Executor 
cleanup, that should be covered in a separate JIRA (which is welcome -- I just 
want to make sure we're on the same page about this particular issue!).

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3740) Use a compressed bitmap to track zero sized blocks in HighlyCompressedMapStatus

2014-09-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3740:
---
Labels: starter  (was: )

> Use a compressed bitmap to track zero sized blocks in 
> HighlyCompressedMapStatus
> ---
>
> Key: SPARK-3740
> URL: https://issues.apache.org/jira/browse/SPARK-3740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>  Labels: starter
>
> HighlyCompressedMapStatus uses a single long to track the average block size. 
> However, if a stage has a lot of zero sized outputs, this leads to 
> inefficiency because executors would need to send requests to fetch zero 
> sized blocks.
> We can use a compressed bitmap to track the zero-sized blocks.
> See discussion in https://github.com/apache/spark/pull/2470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3740) Use a compressed bitmap to track zero sized blocks in HighlyCompressedMapStatus

2014-09-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3740:
--

 Summary: Use a compressed bitmap to track zero sized blocks in 
HighlyCompressedMapStatus
 Key: SPARK-3740
 URL: https://issues.apache.org/jira/browse/SPARK-3740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin


HighlyCompressedMapStatus uses a single long to track the average block size. 
However, if a stage has a lot of zero sized outputs, this leads to inefficiency 
because executors would need to send requests to fetch zero sized blocks.

We can use a compressed bitmap to track the zero-sized blocks.

See discussion in https://github.com/apache/spark/pull/2470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3613) Don't record the size of each shuffle block for large jobs

2014-09-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3613.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Don't record the size of each shuffle block for large jobs
> --
>
> Key: SPARK-3613
> URL: https://issues.apache.org/jira/browse/SPARK-3613
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.2.0
>
>
> MapStatus saves the size of each block (1 byte per block) for a particular 
> map task. This actually means the shuffle metadata is O(M*R), where M = num 
> maps and R = num reduces.
> If M is greater than a certain size, we should probably just send an average 
> size instead of a whole array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152790#comment-14152790
 ] 

Apache Spark commented on SPARK-3654:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2590

> Implement all extended HiveQL statements/commands with a separate parser 
> combinator
> ---
>
> Key: SPARK-3654
> URL: https://issues.apache.org/jira/browse/SPARK-3654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. 
> are currently parsed in a quite hacky way, like this:
> {code}
> if (sql.trim.toLowerCase.startsWith("cache table")) {
>   sql.trim.toLowerCase.startsWith("cache table") match {
> ...
>   }
> }
> {code}
> It would be much better to add an extra parser combinator that parses these 
> syntax extensions first, and then fallback to the normal Hive parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator

2014-09-29 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152789#comment-14152789
 ] 

Ravindra Pesala commented on SPARK-3654:


https://github.com/apache/spark/pull/2590

> Implement all extended HiveQL statements/commands with a separate parser 
> combinator
> ---
>
> Key: SPARK-3654
> URL: https://issues.apache.org/jira/browse/SPARK-3654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. 
> are currently parsed in a quite hacky way, like this:
> {code}
> if (sql.trim.toLowerCase.startsWith("cache table")) {
>   sql.trim.toLowerCase.startsWith("cache table") match {
> ...
>   }
> }
> {code}
> It would be much better to add an extra parser combinator that parses these 
> syntax extensions first, and then fallback to the normal Hive parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3709) Executors don't always report broadcast block removal properly back to the driver

2014-09-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3709:
---
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

> Executors don't always report broadcast block removal properly back to the 
> driver
> -
>
> Key: SPARK-3709
> URL: https://issues.apache.org/jira/browse/SPARK-3709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3709) Executors don't always report broadcast block removal properly back to the driver

2014-09-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3709:
---
Summary: Executors don't always report broadcast block removal properly 
back to the driver  (was: BroadcastSuite.Unpersisting 
rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky 
)

> Executors don't always report broadcast block removal properly back to the 
> driver
> -
>
> Key: SPARK-3709
> URL: https://issues.apache.org/jira/browse/SPARK-3709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3568:
-
Priority: Major  (was: Minor)
Target Version/s: 1.2.0
Shepherd: Xiangrui Meng

> Add metrics for ranking algorithms
> --
>
> Key: SPARK-3568
> URL: https://issues.apache.org/jira/browse/SPARK-3568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
>
> Include widely-used metrics for ranking algorithms, including:
>  - Mean Average Precision
>  - Precision@n: top-n precision
>  - Discounted cumulative gain (DCG) and NDCG 
> This implementation attempts to create a new class called *RankingMetrics* 
> under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
> and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
> methods will be implemented:
>  - *averagePrecision(position: Int): Double* this is the presicion@position
>  - *meanAveragePrecision*: the average of precision@n for all values of n
> - *ndcg*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky

2014-09-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152774#comment-14152774
 ] 

Reynold Xin commented on SPARK-3709:


Hanging driver stack trace
{code}
"pool-1-thread-1-ScalaTest-running-BroadcastSuite" prio=10 
tid=0x7f2114812000 nid=0xc8c in Object.wait() [0x7f20bb8fd000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
- locked <0x0007a2ff4bb8> (a org.apache.spark.scheduler.JobWaiter)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:512)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1087)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1104)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1118)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1132)
at org.apache.spark.rdd.RDD.collect(RDD.scala:775)
at 
org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:291)
at 
org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:232)
at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$13.apply$mcV$sp(BroadcastSuite.scala:112)
at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$13.apply(BroadcastSuite.scala:112)
at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$13.apply(BroadcastSuite.scala:112)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
at 
org.apache.spark.broadcast.BroadcastSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(BroadcastSuite.scala:26)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.broadcast.BroadcastSuite.runTest(BroadcastSuite.scala:26)
   ...
{code}

Executor log
{code}
14/09/29 20:35:57.254 INFO CoarseGrainedExecutorBackend: Registered signal 
handlers for [TERM, HUP, INT]
14/09/29 20:35:57.502 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicabl
e
14/09/29 20:35:57.716 INFO SecurityManager: Changing view acls to: root
14/09/29 20:35:57.717 INFO SecurityManager: Changing modify acls to: root
14/09/29 20:35:57.717 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); u
sers with modify permissions: Set(root)
14/09/29 20:35:58.096 INFO Slf4jLogger: Slf4jLogger started
14/09/29 20:35:58.136 INFO Remoting: Starting remoting
14/09/29 20:35:58.279 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverPropsFetcher@localhost:42339]
14/09/29 20:35:58.280 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://driverPropsFetcher@localhost:42339]
14/09/29 20:35:58.287 INFO Utils: Successfully started service 
'driverPropsFetcher' on port 42339.
14/09/29 20:35:58.461 INFO SecurityManager: Changing view acls to: root
14/09/29 20:35:58.461 INFO SecurityManager: Changing modify acls to: root
14/09/29 20:35:58.462 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); u
sers with modify permissions: Set(root)
14/09/29 20:35:58.466 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
down remote daemon.
14/09/29 20:35:58.467 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
daemon shut down; proceeding with flushing remote transports.
14/09/29 20:35:58.493 INFO Slf4jLogger: Slf4jLogger started
14/09/29 20:35:58.499 INFO Remoting: Starting remoting
14/09/29 20:35:58.502 INFO Remoting: Remoting shut down
14/09/29 20:35:58.503 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
shut down.
14/0

[jira] [Commented] (SPARK-3739) Too many splits for small source file in table scanning

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152762#comment-14152762
 ] 

Apache Spark commented on SPARK-3739:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2589

> Too many splits for small source file in table scanning
> ---
>
> Key: SPARK-3739
> URL: https://issues.apache.org/jira/browse/SPARK-3739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Source file input split is probably better based on block on HDFS for table 
> scanning, other than the settings of 'mapred.reducer.tasks' or 
> 'taskScheduler.defaultParallelism', see 
> [http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.3.0-mr1-cdh5.1.0/org/apache/hadoop/mapred/FileInputFormat.java#203].
> Currently, it seems too many splits for small source file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"

2014-09-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152765#comment-14152765
 ] 

Cheng Lian commented on SPARK-3738:
---

False alarm... it's because of Hive's default SerDe uses '\n' as record 
delimiter.

> InsertIntoHiveTable can't handle strings with "\n"
> --
>
> Key: SPARK-3738
> URL: https://issues.apache.org/jira/browse/SPARK-3738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> Try the following snippet in {{sbt/sbt hive/console}} to reproduce:
> {code}
> sql("drop table if exists z")
> case class Str(s: String)
> sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z")
> table("z").count()
> {code}
> Expected result should be 1, but 2 is returned instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152758#comment-14152758
 ] 

Matt Cheah edited comment on SPARK-1860 at 9/30/14 3:21 AM:


I agree we should focus the scope on cleaning up things that have successfully 
finished. Preserving state is beneficial in erroneous cases.

However, should it not be the case that when an Executor shuts down, it cleans 
up all of the files it created? As you stated, the Worker doesn't know where a 
particular Executor is storing its data, but the Executor should know where it 
is storing its own data, and be managing it and cleaning up when completed. 
This is regardless of the distinction between application data and shuffle data.

The Executor class has a record of the files and jars added through the 
SparkContext (currentFiles and currentJars fields) for that Executor's use, and 
these should naturally expire and be cleaned up when the Executor terminates. 
The top level application directories may still remain for a short time 
(Executors can only delete the subdirectory they work with) but the Worker can 
do a pass and remove empty directories that were all cleaned up by the 
completed Executor task.


was (Author: mcheah):
I agree we should focus the scope on cleaning up things that have successfully 
finished. Preserving state is beneficial in erroneous cases.

However, should it not be the case that when an Executor shuts down, it cleans 
up all of the files it created? As you stated, the Worker doesn't know where a 
particular Executor is storing its data, but the Executor should know where it 
is storing its own data, and be managing it and cleaning up when completed. 
This is regardless of the distinction between application data and shuffle data.

The Executor class has a record of the files and jars added through the 
SparkContext (currentFiles and currentJars fields) for that Executor's use, and 
these should naturally expire and be cleaned up when the Executor terminates.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3739) Too many splits for small source file in table scanning

2014-09-29 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3739:


 Summary: Too many splits for small source file in table scanning
 Key: SPARK-3739
 URL: https://issues.apache.org/jira/browse/SPARK-3739
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


Source file input split is probably better based on block on HDFS for table 
scanning, other than the settings of 'mapred.reducer.tasks' or 
'taskScheduler.defaultParallelism', see 
[http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.3.0-mr1-cdh5.1.0/org/apache/hadoop/mapred/FileInputFormat.java#203].

Currently, it seems too many splits for small source file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152758#comment-14152758
 ] 

Matt Cheah edited comment on SPARK-1860 at 9/30/14 3:14 AM:


I agree we should focus the scope on cleaning up things that have successfully 
finished. Preserving state is beneficial in erroneous cases.

However, should it not be the case that when an Executor shuts down, it cleans 
up all of the files it created? As you stated, the Worker doesn't know where a 
particular Executor is storing its data, but the Executor should know where it 
is storing its own data, and be managing it and cleaning up when completed. 
This is regardless of the distinction between application data and shuffle data.

The Executor class has a record of the files and jars added through the 
SparkContext (currentFiles and currentJars fields) for that Executor's use, and 
these should naturally expire and be cleaned up when the Executor terminates.


was (Author: mcheah):
I agree we should focus the scope on cleaning up things that have successfully 
finished.

However, should it not be the case that when an Executor shuts down, it cleans 
up all of the files it created? As you stated, the Worker doesn't know where a 
particular Executor is storing its data, but the Executor should know where it 
is storing its own data, and be managing it and cleaning up when completed. 
This is regardless of the distinction between application data and shuffle data.

The Executor class has a record of the files and jars added through the 
SparkContext (currentFiles and currentJars fields) for that Executor's use, and 
these should naturally expire and be cleaned up when the Executor terminates.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152758#comment-14152758
 ] 

Matt Cheah commented on SPARK-1860:
---

I agree we should focus the scope on cleaning up things that have successfully 
finished.

However, should it not be the case that when an Executor shuts down, it cleans 
up all of the files it created? As you stated, the Worker doesn't know where a 
particular Executor is storing its data, but the Executor should know where it 
is storing its own data, and be managing it and cleaning up when completed. 
This is regardless of the distinction between application data and shuffle data.

The Executor class has a record of the files and jars added through the 
SparkContext (currentFiles and currentJars fields) for that Executor's use, and 
these should naturally expire and be cleaned up when the Executor terminates.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152744#comment-14152744
 ] 

Aaron Davidson commented on SPARK-1860:
---

Note that there are two separate forms of cleanup: application data cleanup 
(jars and logs) and shuffle data cleanup. Standalone Worker cleanup deals with 
the former, Executor termination handlers deal with the latter. The purpose is 
not to deal with executors that have terminated ungracefully, but to actually 
clean up old application directories.

Here the idea is that a Worker may be running for a very long time (weeks, 
months) and over time accumulates hundreds of application directories. We want 
to delete these directories after several days of them being terminated (today 
we'll clean them up whether or not they're terminated, which loses their jars 
and logs), after which we presumably don't care anymore. We do not want to 
clean them up immediately after application termination.

The Worker performing shuffle data cleanup for ungracefully terminated 
Executors is not a bad idea, but is a (smallish) feature onto itself, as the 
Worker does not currently know where a particular Executor is storing its data.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations

2014-09-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152730#comment-14152730
 ] 

Patrick Wendell commented on SPARK-3504:


I just updated the title to make it more descriptive. Not an expert on this 
part of the code!

> KMeans optimization: track distances and unmoved cluster centers across 
> iterations
> --
>
> Key: SPARK-3504
> URL: https://issues.apache.org/jira/browse/SPARK-3504
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
> recomputes all distances to all cluster centers on each iteration.  In later 
> iterations of Lloyd's algorithm, points don't change clusters and clusters 
> don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which 
> cluster it belongs to and the distance to that cluster, one can avoid 
> recomputing distances in many cases with very little increase in memory 
> requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 
> c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here 
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, 
> resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
> this improvement resulting in a reduction of roughly 75% in running time with 
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is 
> not a simple modification of the existing implementation.  Please let me know 
> if you are interested in this new implementation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations

2014-09-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3504:
---
Summary: KMeans optimization: track distances and unmoved cluster centers 
across iterations  (was: KMeans clusterer is slow, can be sped up by 75%)

> KMeans optimization: track distances and unmoved cluster centers across 
> iterations
> --
>
> Key: SPARK-3504
> URL: https://issues.apache.org/jira/browse/SPARK-3504
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
> recomputes all distances to all cluster centers on each iteration.  In later 
> iterations of Lloyd's algorithm, points don't change clusters and clusters 
> don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which 
> cluster it belongs to and the distance to that cluster, one can avoid 
> recomputing distances in many cases with very little increase in memory 
> requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 
> c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here 
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, 
> resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
> this improvement resulting in a reduction of roughly 75% in running time with 
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is 
> not a simple modification of the existing implementation.  Please let me know 
> if you are interested in this new implementation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2548) JavaRecoverableWordCount is missing

2014-09-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2548:
---
Labels: starter  (was: )

> JavaRecoverableWordCount is missing
> ---
>
> Key: SPARK-2548
> URL: https://issues.apache.org/jira/browse/SPARK-2548
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Streaming
>Affects Versions: 0.9.2, 1.0.1
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>
> JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We 
> need to rewrite the example because the code was lost during the migration 
> from spark/spark-incubating to apache/spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-09-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3736:
---
Priority: Critical  (was: Major)
Target Version/s: 1.2.0

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"

2014-09-29 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3738:
-

 Summary: InsertIntoHiveTable can't handle strings with "\n"
 Key: SPARK-3738
 URL: https://issues.apache.org/jira/browse/SPARK-3738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Priority: Blocker


Try the following snippet in {{sbt/sbt hive/console}} to reproduce:
{code}
sql("drop table if exists z")
case class Str(s: String)
sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z")
table("z").count()
{code}
Expected result should be 1, but 2 is returned instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3737) [Docs] Broken Link - Minor typo

2014-09-29 Thread Vincent Ohprecio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Ohprecio closed SPARK-3737.
---
Resolution: Duplicate

Closed PR and this issue can be closed:
[SPARK]-2558 is the original open 2 days ago. under 'fix the "Building Spark" 
url #2558'

> [Docs] Broken Link - Minor typo   
> 
>
> Key: SPARK-3737
> URL: https://issues.apache.org/jira/browse/SPARK-3737
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Vincent Ohprecio
>Priority: Trivial
> Fix For: 1.1.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3737) [Docs] Broken Link - Minor typo

2014-09-29 Thread Vincent Ohprecio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152709#comment-14152709
 ] 

Vincent Ohprecio commented on SPARK-3737:
-

I have closed the PR. Similar to:

fix the "Building Spark" url #2558

> [Docs] Broken Link - Minor typo   
> 
>
> Key: SPARK-3737
> URL: https://issues.apache.org/jira/browse/SPARK-3737
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Vincent Ohprecio
>Priority: Trivial
> Fix For: 1.1.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3737) [Docs] Broken Link - Minor typo

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152703#comment-14152703
 ] 

Apache Spark commented on SPARK-3737:
-

User 'bigsnarfdude' has created a pull request for this issue:
https://github.com/apache/spark/pull/2587

> [Docs] Broken Link - Minor typo   
> 
>
> Key: SPARK-3737
> URL: https://issues.apache.org/jira/browse/SPARK-3737
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Vincent Ohprecio
>Priority: Trivial
> Fix For: 1.1.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3737) [Docs] Broken Link - Minor typo

2014-09-29 Thread Vincent Ohprecio (JIRA)
Vincent Ohprecio created SPARK-3737:
---

 Summary: [Docs] Broken Link - Minor typo   
 Key: SPARK-3737
 URL: https://issues.apache.org/jira/browse/SPARK-3737
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Vincent Ohprecio
Priority: Trivial
 Fix For: 1.1.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present

2014-09-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3729:

Target Version/s: 1.2.0

> Null-pointer when constructing a HiveContext when settings are present
> --
>
> Key: SPARK-3729
> URL: https://issues.apache.org/jira/browse/SPARK-3729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> {code}
> java.lang.NullPointerException
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:78)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3635) Find Strongly Connected Components with Graphx has a small bug

2014-09-29 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3635.
---
   Resolution: Fixed
Fix Version/s: 1.1.1
   1.2.0

Issue resolved by pull request 2486
[https://github.com/apache/spark/pull/2486]

> Find Strongly Connected Components with Graphx has a small bug
> --
>
> Key: SPARK-3635
> URL: https://issues.apache.org/jira/browse/SPARK-3635
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0
> Environment: VMWare, Centos 6.5
>Reporter: Oded Zimerman
>Priority: Trivial
> Fix For: 1.2.0, 1.1.1
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The strongly connected components function (spark / graphx / src / main / 
> scala / org / apache / spark / graphx / lib / 
> StronglyConnectedComponents.scala) has a typo in the condition on line 78.
> I think the condition should be "if (e.srcAttr._1 < e.dstAttr._1)" instead of 
> "if (e.srcId < e.dstId)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-09-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152636#comment-14152636
 ] 

Xiangrui Meng commented on SPARK-3434:
--

[~shivaram] Could you post the design of the partitioning strategy for block 
matrices? I think we should have a 2D partitioner, which consists of the row 
partitioner and column partitioner. A matrix with partitioner (p1, p2) can 
multiply a matrix with partitioner (p2, p3), resulting a matrix with 
partitioner (p1, p3).

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152621#comment-14152621
 ] 

Matt Cheah commented on SPARK-1860:
---

ExecutorRunner seems to have various cases corresponding to how the Executor 
exited. ExecutorRunner also creates the directory in fetchAndRunExecutor(). We 
can catch all of the exit cases there and delete the directory in any case.

In the case that the executor failed to exit, however, it would be best to 
preserve the logs. instead of blindly killing the whole directory.

On that note, one other thought is that perhaps we actually want to preserve 
the directory entirely upon crash since preserving the state will allow us to 
better understand what happened, i.e. what jars and files were present and so 
on.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-09-29 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152594#comment-14152594
 ] 

Andrew Ash commented on SPARK-3736:
---

I can't tell for sure but this is possibly related to SPARK-704 or SPARK-1771

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-09-29 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3736:
--
Affects Version/s: 1.1.0

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-09-29 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3736:
-

 Summary: Workers should reconnect to Master if disconnected
 Key: SPARK-3736
 URL: https://issues.apache.org/jira/browse/SPARK-3736
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Ash


In standalone mode, when a worker gets disconnected from the master for some 
reason it never attempts to reconnect.  In this situation you have to bounce 
the worker before it will reconnect to the master.

The preferred alternative is to follow what Hadoop does -- when there's a 
disconnect, attempt to reconnect at a particular interval until successful (I 
think it repeats indefinitely every 10sec).

This has been observed by:

- [~pkolaczk] in 
http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
- [~romi-totango] in 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
- [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152552#comment-14152552
 ] 

Andrew Ash commented on SPARK-1860:
---

Cleanup on executor shutdown is part of the solution (and should be done IMO) 
but not all of it.

Particularly it won't cover when an executor dies from an OOM or a kill -9 or 
any other unclean shutdown.  The perfect solution would do the event-based 
cleanup self on executor shutdown, and also a periodic cleaner to get rid of 
directories that were shutdown uncleanly.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152526#comment-14152526
 ] 

Matthew Farrellee commented on SPARK-3685:
--

if you're going to go down this path the best (i'd say correct) way to 
implement it is to have support from yarn - a way to tell yarn "i'm only going 
to need X,Y,Z resources from now on" without giving up the execution container. 
i bet there's a way to re-exec the jvm into a smaller form factor.

> Spark's local dir should accept only local paths
> 
>
> Key: SPARK-3685
> URL: https://issues.apache.org/jira/browse/SPARK-3685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it 
> will try to do is create a folder called "hdfs:" and put "tmp" inside it. 
> This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
> of Hadoop's file system to parse this path. We also need to resolve the path 
> appropriately.
> This may not have an urgent use case, but it fails silently and does what is 
> least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-09-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3694:
---
Labels: starter  (was: )

> Allow printing object graph of tasks/RDD's with a debug flag
> 
>
> Key: SPARK-3694
> URL: https://issues.apache.org/jira/browse/SPARK-3694
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>  Labels: starter
>
> This would be useful for debugging extra references inside of RDD's
> Here is an example for inspiration:
> http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
> We'd want to print this trace for both the RDD serialization inside of the 
> DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152505#comment-14152505
 ] 

Apache Spark commented on SPARK-3734:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2586

> DriverRunner should not read SPARK_HOME from submitter's environment
> 
>
> Key: SPARK-3734
> URL: https://issues.apache.org/jira/browse/SPARK-3734
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
> Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
> submitting machine, then DriverRunner will attempt to use the _submitter's_ 
> JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
> which can cause the job to fail unless the submitter and worker have Java 
> installed in the same location.
> This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
> {{command.environment}}; PR pending shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment

2014-09-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3734:
--
Description: 
If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
submitting machine, then DriverRunner will attempt to use the _submitter's_ 
JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
which can cause the job to fail unless the submitter and worker have Java 
installed in the same location.

This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
{{command.environment}}; PR pending shortly.


  was:
If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
submitting machine, then DriverRunner will attempt to use the _submitter's_ 
JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
which can cause the job to fail unless the submitter and worker don't have Java 
installed in the same location.

This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
{{command.environment}}; PR pending shortly.



> DriverRunner should not read SPARK_HOME from submitter's environment
> 
>
> Key: SPARK-3734
> URL: https://issues.apache.org/jira/browse/SPARK-3734
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
> Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
> submitting machine, then DriverRunner will attempt to use the _submitter's_ 
> JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
> which can cause the job to fail unless the submitter and worker have Java 
> installed in the same location.
> This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
> {{command.environment}}; PR pending shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2014-09-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3735:


 Summary: Sending the factor directly or AtA based on the cost in 
ALS
 Key: SPARK-3735
 URL: https://issues.apache.org/jira/browse/SPARK-3735
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Xiangrui Meng


It is common to have some super popular products in the dataset. In this case, 
sending many user factors to the target product block could be more expensive 
than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i r_ij` to 
the product block. The cost of sending a single factor is `k`, while the cost 
of sending a normal equation is much more expensive, `k * (k + 3) / 2`. 
However, if we use normal equation for all products associated with a user, we 
don't need to send this user factor.

Determining the optimal assignment is hard. But we could use a simple 
heuristic. Inside any rating block,

1) order the product ids by the number of user ids associated with them in desc 
order
2) starting from the most popular product, mark popular products as "use normal 
eq" and calculate the cost

Remember the best assignment that comes with the lowest cost and use it for 
computation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment

2014-09-29 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3734:
-

 Summary: DriverRunner should not read SPARK_HOME from submitter's 
environment
 Key: SPARK-3734
 URL: https://issues.apache.org/jira/browse/SPARK-3734
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen


If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark 
Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the 
submitting machine, then DriverRunner will attempt to use the _submitter's_ 
JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), 
which can cause the job to fail unless the submitter and worker don't have Java 
installed in the same location.

This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of 
{{command.environment}}; PR pending shortly.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3007) Add "Dynamic Partition" support to Spark Sql hive

2014-09-29 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-3007.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

https://github.com/apache/spark/commit/0bbe7faeffa17577ae8a33dfcd8c4c783db5c909

> Add "Dynamic Partition" support  to  Spark Sql hive
> ---
>
> Key: SPARK-3007
> URL: https://issues.apache.org/jira/browse/SPARK-3007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: baishuo
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152421#comment-14152421
 ] 

Matt Cheah commented on SPARK-1860:
---

Apologies for any naivety - this will be the first issue I tackle as a Spark 
contributor.

Mingyu and I had a short chat and we thought it would be reasonable for the 
Executor to simply clean up its own state when it shuts down. Is there anything 
preventing Executor.stop() from cleaning up the app directory it was using?

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152409#comment-14152409
 ] 

Thomas Graves edited comment on SPARK-3732 at 9/29/14 10:19 PM:


 I understand your usecase and need for it, but I think at this point I don't 
think we want to say we support it without properly addressing the bigger 
picture.  The only public supported non-deprecated api is via spark-submit 
script.  This means that we won't guarantee backwards compatibility on it. 

Note that we specifically discussed the Client being public on another PR and 
it was decided that it isn't officially supported.   The only reason the object 
was left public was for backwards compatibility with how you used to start 
spark on yarn with the spark-class script.  




was (Author: tgraves):
 I understand your usecase and need for it, but I think at this point I don't 
think we want to say we support it without properly addressing the bigger 
picture.  The only public supported non-deprecated api is via spark-submit 
script.  This means that we won't guarantee backwards compatibility on it. 

Note that we specifically discussed the Client beyond public on another PR and 
it was decided that it isn't officially supported.   The only reason the object 
was left public was for backwards compatibility with how you used to start 
spark on yarn with the spark-class script.  



> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152409#comment-14152409
 ] 

Thomas Graves commented on SPARK-3732:
--

 I understand your usecase and need for it, but I think at this point I don't 
think we want to say we support it without properly addressing the bigger 
picture.  The only public supported non-deprecated api is via spark-submit 
script.  This means that we won't guarantee backwards compatibility on it. 

Note that we specifically discussed the Client beyond public on another PR and 
it was decided that it isn't officially supported.   The only reason the object 
was left public was for backwards compatibility with how you used to start 
spark on yarn with the spark-class script.  



> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152377#comment-14152377
 ] 

Apache Spark commented on SPARK-3709:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2585

> BroadcastSuite.Unpersisting 
> rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is 
> flaky 
> 
>
> Key: SPARK-3709
> URL: https://issues.apache.org/jira/browse/SPARK-3709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky

2014-09-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3709:
---
Assignee: Reynold Xin  (was: Cheng Lian)

> BroadcastSuite.Unpersisting 
> rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is 
> flaky 
> 
>
> Key: SPARK-3709
> URL: https://issues.apache.org/jira/browse/SPARK-3709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-09-29 Thread Arun Ahuja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152351#comment-14152351
 ] 

Arun Ahuja commented on SPARK-3630:
---

We have seen this issue as well:

{code}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
...
Caused by: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
{code}

This is with Spark 1.1, running on a Yarn cluster.  The issue seems to be 
fairly frequent but does not happen on every run.  

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Ankur Dave
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-09-29 Thread Sung Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152347#comment-14152347
 ] 

Sung Chung commented on SPARK-3717:
---

I think that this would be great as an alternative option.

1. Partitioning by rows (as currently implemented) scales in # of rows.
2. Partitioning by features scales in # of features.

With good modularization, I think a lot of tree logic (splitting, building 
trees) could be shared among the different partitioning schemes.

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3733) Support for programmatically submitting Spark jobs

2014-09-29 Thread Sotos Matzanas (JIRA)
Sotos Matzanas created SPARK-3733:
-

 Summary: Support for programmatically submitting Spark jobs
 Key: SPARK-3733
 URL: https://issues.apache.org/jira/browse/SPARK-3733
 Project: Spark
  Issue Type: New Feature
Affects Versions: 1.1.0
Reporter: Sotos Matzanas


Right now it's impossible to programmatically submit Spark jobs via a Scala (or 
Java) API. We would like to see that in a future version of Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Sotos Matzanas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152323#comment-14152323
 ] 

Sotos Matzanas commented on SPARK-3732:
---

[~tgraves] this jira is the first step for us to move forward with our own 
(very limited for now) version of programmatic job submits. I can add another 
jira to address the big issue, but we would like to see this one resolved 
first. Once we are confident with our solution we can contribute to the second 
jira. Let us know what you think

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3708) Backticks aren't handled correctly is aliases

2014-09-29 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152321#comment-14152321
 ] 

Ravindra Pesala commented on SPARK-3708:


I guess here you mentioned about HiveContext as there is no support of backtick 
in SqlContext.  I will work on this issue.Thank you.

> Backticks aren't handled correctly is aliases
> -
>
> Key: SPARK-3708
> URL: https://issues.apache.org/jira/browse/SPARK-3708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Michael Armbrust
>
> Here's a failing test case:
> {code}
> sql("SELECT k FROM (SELECT `key` AS `k` FROM src) a")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152318#comment-14152318
 ] 

Marcelo Vanzin commented on SPARK-3732:
---

BTW, if the call is removed, it should be possible to do what you want more 
generically by calling {{SparkSubmit.main}} directly. That's still a little 
fishy, but it's a lot less fishy than calling Yarn-specific code directly.

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152309#comment-14152309
 ] 

Marcelo Vanzin commented on SPARK-3732:
---

Removing the call should work regardless; it's redundant, since the code will 
just exit normally anyway after that.

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152308#comment-14152308
 ] 

Thomas Graves commented on SPARK-3732:
--

I think you should just change the name of this jira to add support for 
programmatically calling the spark yarn Client.  As you have found this isn't 
currently supported and I wouldn't want to just remove the exit and say its 
supported without thinking about other implications.   


> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Sotos Matzanas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152302#comment-14152302
 ] 

Sotos Matzanas commented on SPARK-3732:
---

we added the option as insurance against backward compatibility, removing the 
System.exit() call will obviously work unless somebody is checking the exit 
code from spark-submit explicitely

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152294#comment-14152294
 ] 

Marcelo Vanzin commented on SPARK-3732:
---

I think that explicit System.exit() could just be removed. Exposing an option 
for this sounds like overkill.

[~tgraves] had some comments about that call in the past, though.

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152291#comment-14152291
 ] 

Apache Spark commented on SPARK-3732:
-

User 'smatzana' has created a pull request for this issue:
https://github.com/apache/spark/pull/2584

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-09-29 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152275#comment-14152275
 ] 

Reza Zadeh commented on SPARK-3434:
---

It looks like Shivaram Venkataraman from the AMPlab has started work on this. I 
will be meeting with him to see if we can reuse some his work.

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-09-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Description: 
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Each worker stores:
** a subset of columns (i.e., a subset of features).  If a worker stores 
feature j, then the worker stores the feature value for all instances (i.e., 
the whole column).
** all labels
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost < partitioning instances cost when:
* D * (M * 8 + N) < 2^D * M * B * C * 8
* D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
right hand side)
* N < [ 2^D * M * B * C * 8 ] / D

Example: many instances:
* 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
* Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
* Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8


  was:
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, 

[jira] [Commented] (SPARK-3730) Any one else having building spark recently

2014-09-29 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152269#comment-14152269
 ] 

Anant Daksh Asthana commented on SPARK-3730:


Definately not a spark issue. Just thought some one on here knew a solution.




> Any one else having building spark recently
> ---
>
> Key: SPARK-3730
> URL: https://issues.apache.org/jira/browse/SPARK-3730
> Project: Spark
>  Issue Type: Question
>Reporter: Anant Daksh Asthana
>Priority: Minor
>
> I get an assertion error in 
> spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to 
> build.
> I am building using 
> mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package
> Here is the error i get http://pastebin.com/Shi43r53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-29 Thread Manish Amde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152251#comment-14152251
 ] 

Manish Amde commented on SPARK-1547:


Sure. I like your naming suggestion. 

I will rebase from the latest master now that the RF PR has been accepted.  I 
will create a WIP PR soon after (with tests and docs) so that we can discuss 
the code in greater detail.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Sotos Matzanas (JIRA)
Sotos Matzanas created SPARK-3732:
-

 Summary: Yarn Client: Add option to NOT System.exit() at end of 
main()
 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas


We would like to add the ability to create and submit Spark jobs 
programmatically via Scala/Java. We have found a way to hack this and submit 
jobs via Yarn, but since 

org.apache.spark.deploy.yarn.Client.main()

exits with either 0 or 1 in the end, this will mean exit of our own program. We 
would like to add an optional spark conf param to NOT exit at the end of the 
main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky

2014-09-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152238#comment-14152238
 ] 

Reynold Xin commented on SPARK-3709:


Adding stack trace
{code}
[info] - Unpersisting TorrentBroadcast on executors only in distributed mode 
*** FAILED ***
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 17, localhost): java.io.IOException: sendMessageReliably failed with 
ACK that signalled a remote error
[info] 
org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:864)
[info] 
org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:856)
[info] 
org.apache.spark.network.nio.ConnectionManager$MessageStatus.markDone(ConnectionManager.scala:61)
[info] 
org.apache.spark.network.nio.ConnectionManager.org$apache$spark$network$nio$ConnectionManager$$handleMessage(ConnectionManager.scala:655)
[info] 
org.apache.spark.network.nio.ConnectionManager$$anon$10.run(ConnectionManager.scala:515)
[info] 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[info] 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[info] java.lang.Thread.run(Thread.java:745)
[info] Driver stacktrace:
[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1192)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1181)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180)
[info]   at scala.coSpark assembly has been built with Hive, including 
Datanucleus jars on classpath
Spark assembly has been built with Hive, including Datanucleus jars on classpath
llection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1180)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695)
[info]   at scala.Option.foreach(Option.scala:236)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:695)
[info]   ...
{code}

> BroadcastSuite.Unpersisting 
> rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is 
> flaky 
> 
>
> Key: SPARK-3709
> URL: https://issues.apache.org/jira/browse/SPARK-3709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152228#comment-14152228
 ] 

Andrew Or commented on SPARK-3685:
--

Not sure if I fully understand what you mean. If I'm running an executor and I 
request 30G from the beginning, my application uses all of it to do computation 
and all is good. After I decommission the executor, I would like to keep 1G 
just to serve the shuffle files, but this can't be done easily because we need 
to start a smaller JVM and a smaller container. (Yarn currently doesn't support 
scaling the size of a container while it's still running yet). Either way we 
need to transfer some state from the bigger JVM to the smaller JVM, and that 
adds some complexity to the design. The simplest alternative then would just to 
write whatever state to an external location and just terminate the executor 
JVM / container without starting a smaller one, and then have an external 
service that is long-running to serve these files.

One proposal here then is to write these shuffle files to a special location 
and have the Yarn NM shuffle service serve the files. This is an alternative to 
DFS shuffle that is, however, highly specific to Yarn. I am doing some initial 
prototyping of this (the Yarn shuffle) approach to see how this will pan out.

> Spark's local dir should accept only local paths
> 
>
> Key: SPARK-3685
> URL: https://issues.apache.org/jira/browse/SPARK-3685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it 
> will try to do is create a folder called "hdfs:" and put "tmp" inside it. 
> This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
> of Hadoop's file system to parse this path. We also need to resolve the path 
> appropriately.
> This may not have an urgent use case, but it fails silently and does what is 
> least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3652) upgrade spark sql hive version to 0.13.1

2014-09-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152201#comment-14152201
 ] 

Zhan Zhang commented on SPARK-3652:
---

Supporting different hive version is required. I think you can send another PR 
to support thriftserver, which I didn't include in spark-2706 for limiting the 
scope. 

> upgrade spark sql hive version to 0.13.1
> 
>
> Key: SPARK-3652
> URL: https://issues.apache.org/jira/browse/SPARK-3652
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
>
> now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3166) Custom serialisers can't be shipped in application jars

2014-09-29 Thread Paul Wais (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152194#comment-14152194
 ] 

Paul Wais commented on SPARK-3166:
--

+1 this issue is related to: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-find-proto-buffer-class-error-with-RDD-lt-protobuf-gt-td14529.html

It looks like this PR places users jars on the executor's root classpath, which 
should fix the issue investigated in the above thread.

> Custom serialisers can't be shipped in application jars
> ---
>
> Key: SPARK-3166
> URL: https://issues.apache.org/jira/browse/SPARK-3166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Graham Dennis
>
> Spark cannot currently use a custom serialiser that is shipped with the 
> application jar. Trying to do this causes a java.lang.ClassNotFoundException 
> when trying to instantiate the custom serialiser in the Executor processes. 
> This occurs because Spark attempts to instantiate the custom serialiser 
> before the application jar has been shipped to the Executor process. A 
> reproduction of the problem is available here: 
> https://github.com/GrahamDennis/spark-custom-serialiser
> I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches 
> as of August 21, 2014.  This issue is related to SPARK-2878, and my fix for 
> that issue (https://github.com/apache/spark/pull/1890) also solves this.  My 
> pull request was not merged because it adds the user jar to the Executor 
> processes' class path at launch time.  Such a significant change was thought 
> by [~rxin] to require more QA, and should be considered for inclusion in 1.2 
> at the earliest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-09-29 Thread Milan Straka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Straka updated SPARK-3731:

Attachment: worker.log

Sample worker.log showing the problem. For example, consider rdd_1_1. It has 
size 46.3MB. At the beginning, the caching work, but that stops -- the last 
time rdd_1_1 does not fit into cache, the following is reported:
{{14/09/29 21:53:10 WARN CacheManager: Not enough space to cache partition 
rdd_1_1 in memory! Free memory is 148908945 bytes.}}


> RDD caching stops working in pyspark after some time
> 
>
> Key: SPARK-3731
> URL: https://issues.apache.org/jira/browse/SPARK-3731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
> Environment: Linux, 32bit, standalone mode
>Reporter: Milan Straka
> Attachments: worker.log
>
>
> Consider a file F which when loaded with sc.textFile and cached takes up 
> slightly more than half of free memory for RDD cache.
> When in PySpark the following is executed:
>   1) a = sc.textFile(F)
>   2) a.cache().count()
>   3) b = sc.textFile(F)
>   4) b.cache().count()
> and then the following is repeated (for example 10 times):
>   a) a.unpersist().cache().count()
>   b) b.unpersist().cache().count()
> after some time, there are no RDD cached in memory.
> Also, since that time, no other RDD ever gets cached (the worker always 
> reports something like "WARN CacheManager: Not enough space to cache 
> partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if 
> rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
> all executors have 0MB memory used (which is consistent with the CacheManager 
> warning).
> When doing the same in scala, everything works perfectly.
> I understand that this is a vague description, but I do no know how to 
> describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-09-29 Thread Milan Straka (JIRA)
Milan Straka created SPARK-3731:
---

 Summary: RDD caching stops working in pyspark after some time
 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, standalone mode
Reporter: Milan Straka


Consider a file F which when loaded with sc.textFile and cached takes up 
slightly more than half of free memory for RDD cache.

When in PySpark the following is executed:
  1) a = sc.textFile(F)
  2) a.cache().count()
  3) b = sc.textFile(F)
  4) b.cache().count()
and then the following is repeated (for example 10 times):
  a) a.unpersist().cache().count()
  b) b.unpersist().cache().count()
after some time, there are no RDD cached in memory.

Also, since that time, no other RDD ever gets cached (the worker always reports 
something like "WARN CacheManager: Not enough space to cache partition rdd_23_5 
in memory! Free memory is 277478190 bytes.", even if rdd_23_5 is ~50MB). The 
Executors tab of the Application Detail UI shows that all executors have 0MB 
memory used (which is consistent with the CacheManager warning).

When doing the same in scala, everything works perfectly.

I understand that this is a vague description, but I do no know how to describe 
the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE

2014-09-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2693:

Assignee: Ravindra Pesala

> Support for UDAF Hive Aggregates like PERCENTILE
> 
>
> Key: SPARK-2693
> URL: https://issues.apache.org/jira/browse/SPARK-2693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ravindra Pesala
>
> {code}
> SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), 
> year,month,day FROM  raw_data_table  GROUP BY year, month, day
> MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an 
> error as shown below.
> Exception in thread "main" java.lang.RuntimeException: No handler for udf 
> class org.apache.hadoop.hive.ql.udf.UDAFPercentile
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
> {code}
> This aggregate extends UDAF, which we don't yet have a wrapper for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE

2014-09-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2693:

Priority: Critical  (was: Major)

> Support for UDAF Hive Aggregates like PERCENTILE
> 
>
> Key: SPARK-2693
> URL: https://issues.apache.org/jira/browse/SPARK-2693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ravindra Pesala
>Priority: Critical
>
> {code}
> SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), 
> year,month,day FROM  raw_data_table  GROUP BY year, month, day
> MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an 
> error as shown below.
> Exception in thread "main" java.lang.RuntimeException: No handler for udf 
> class org.apache.hadoop.hive.ql.udf.UDAFPercentile
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
> {code}
> This aggregate extends UDAF, which we don't yet have a wrapper for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152152#comment-14152152
 ] 

Matthew Farrellee commented on SPARK-3685:
--

the root of the resource problem is how they're handed out. yarn is giving you 
a whole cpu, some amount of memory, some amount of network and some amount of 
disk to work with. your executor (like any program) uses different amounts of 
resources throughout its execution. at points in the execution the resource 
profile changes, call the demarcated regions "phases". so an executor may 
transition from a high resource phase to a low resource phase. in a low 
resource phase, you may want to free up resources for other executors, but 
maintain enough to do basic operations (say: serve a shuffle file). this is a 
problem that should be solved by the resource manager. in my opinion, a 
solution w/i spark that isn't faciliated by the RN is a workaround/hack and 
should be avoided. an example of a RN facilitated solution might be a message 
the executor can send to yarn to indicate its resources can be free'd, except 
for some minimum amount.

> Spark's local dir should accept only local paths
> 
>
> Key: SPARK-3685
> URL: https://issues.apache.org/jira/browse/SPARK-3685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it 
> will try to do is create a folder called "hdfs:" and put "tmp" inside it. 
> This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
> of Hadoop's file system to parse this path. We also need to resolve the path 
> appropriately.
> This may not have an urgent use case, but it fails silently and does what is 
> least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Anant Daksh Asthana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anant Daksh Asthana closed SPARK-3725.
--
Resolution: Not a Problem

> Link to building spark returns a 404
> 
>
> Key: SPARK-3725
> URL: https://issues.apache.org/jira/browse/SPARK-3725
> Project: Spark
>  Issue Type: Documentation
>Reporter: Anant Daksh Asthana
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3730) Any one else having building spark recently

2014-09-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152140#comment-14152140
 ] 

Sean Owen commented on SPARK-3730:
--

(The profile is "hadoop-2.3" but that's not the issue.)
I have seen this too and it's a {{scalac}} bug as far as I can tell, as you can 
see from the stack trace. It's not a Spark issue.

> Any one else having building spark recently
> ---
>
> Key: SPARK-3730
> URL: https://issues.apache.org/jira/browse/SPARK-3730
> Project: Spark
>  Issue Type: Question
>Reporter: Anant Daksh Asthana
>Priority: Minor
>
> I get an assertion error in 
> spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to 
> build.
> I am building using 
> mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package
> Here is the error i get http://pastebin.com/Shi43r53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152138#comment-14152138
 ] 

Sean Owen commented on SPARK-3725:
--

No, that links to the raw markdown. Truly, the fix is to rebuild the site. The 
source is fine.

> Link to building spark returns a 404
> 
>
> Key: SPARK-3725
> URL: https://issues.apache.org/jira/browse/SPARK-3725
> Project: Spark
>  Issue Type: Documentation
>Reporter: Anant Daksh Asthana
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3730) Any one else having building spark recently

2014-09-29 Thread Anant Daksh Asthana (JIRA)
Anant Daksh Asthana created SPARK-3730:
--

 Summary: Any one else having building spark recently
 Key: SPARK-3730
 URL: https://issues.apache.org/jira/browse/SPARK-3730
 Project: Spark
  Issue Type: Question
Reporter: Anant Daksh Asthana
Priority: Minor


I get an assertion error in 
spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to 
build.
I am building using 
mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package

Here is the error i get http://pastebin.com/Shi43r53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present

2014-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152113#comment-14152113
 ] 

Apache Spark commented on SPARK-3729:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2583

> Null-pointer when constructing a HiveContext when settings are present
> --
>
> Key: SPARK-3729
> URL: https://issues.apache.org/jira/browse/SPARK-3729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> {code}
> java.lang.NullPointerException
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:78)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152101#comment-14152101
 ] 

Joseph K. Bradley commented on SPARK-1547:
--

This will be great to have!  The WIP code and the list of to-do items look good 
to me.

Small comment: For the losses, it would be good to rename "residual" to either 
"pseudoresidual" (following Friedman's paper) or to "lossGradient" (which is 
more literal/accurate).  It would also be nice to have the loss classes compute 
the loss itself, so that we can compute that at the end (and later track it 
along the way).


> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present

2014-09-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-3729:
---

Assignee: Michael Armbrust

> Null-pointer when constructing a HiveContext when settings are present
> --
>
> Key: SPARK-3729
> URL: https://issues.apache.org/jira/browse/SPARK-3729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> {code}
> java.lang.NullPointerException
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:78)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present

2014-09-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3729:
---

 Summary: Null-pointer when constructing a HiveContext when 
settings are present
 Key: SPARK-3729
 URL: https://issues.apache.org/jira/browse/SPARK-3729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker


{code}
java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:78)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76)
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7

2014-09-29 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695
 ] 

Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:05 PM:


here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

I ran into a small problem the ipython magic %matplotlib inline

creates an error, you can work around this by commenting it out. 

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf "\n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh




was (Author: aedwip):

here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

Any idea what I missed?

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf "\n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh



> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.2.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7

2014-09-29 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695
 ] 

Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:03 PM:



here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

Any idea what I missed?

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf "\n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh




was (Author: aedwip):
I must have missed something. I am running iPython notebook over a ssh tunnel. 
I am still running using the old version. I made sure to

export PYSPARK_PYTHON=python2.7

I also tried
export PYSPARK_PYTHON=/usr/bin/python2.7\

import IPython
print IPython.sys_info()
{'commit_hash': '858d539',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': 
'/usr/lib/python2.6/site-packages/ipython-0.13.2-py2.6.egg/IPython',
 'ipython_version': '0.13.2',
 'os_name': 'posix',
 'platform': 'Linux-3.4.37-40.44.amzn1.x86_64-x86_64-with-glibc2.2.5',
 'sys_executable': '/usr/bin/python2.6',
 'sys_platform': 'linux2',
 'sys_version': '2.6.9 (unknown, Sep 13 2014, 00:25:11) \n[GCC 4.8.2 20140120 
(Red Hat 4.8.2-16)]'}


here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

Any idea what I missed?

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf "\n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh



> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.2.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152070#comment-14152070
 ] 

Andrew Or commented on SPARK-3685:
--

Yeah there will be a design doc soon on the possible solutions for dealing with 
shuffles. Note that one of the main motivations of doing this is to free up 
containers in Yarn when an application is not using it, so maintaining a pool 
of executor containers does not achieve what we want. Also, DFS shuffle is only 
one of the solutions we will consider, but we probably won't end up relying on 
it because of the overhead it adds (i.e. we'll probably need a different 
solution down the road either way).

It could be a warning, but I think an exception is appropriate here because the 
user clearly thinks that its shuffle files are going into HDFS when they're 
not. Also, the fact that it fails-fast means the user knows Spark won't do what 
he/she wants before even a single shuffle file is written. Either way I don't 
feel strongly about this.



> Spark's local dir should accept only local paths
> 
>
> Key: SPARK-3685
> URL: https://issues.apache.org/jira/browse/SPARK-3685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it 
> will try to do is create a folder called "hdfs:" and put "tmp" inside it. 
> This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
> of Hadoop's file system to parse this path. We also need to resolve the path 
> appropriately.
> This may not have an urgent use case, but it fails silently and does what is 
> least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152061#comment-14152061
 ] 

Anant Daksh Asthana commented on SPARK-3725:


Would this pull request be a good idea? 
https://github.com/apache/spark/pull/2582

> Link to building spark returns a 404
> 
>
> Key: SPARK-3725
> URL: https://issues.apache.org/jira/browse/SPARK-3725
> Project: Spark
>  Issue Type: Documentation
>Reporter: Anant Daksh Asthana
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152050#comment-14152050
 ] 

Sean Owen commented on SPARK-3725:
--

Yes of course, it's already in the repo and has been for a while. It was just 
renamed with a redirect from the old URL. But, that update hasn't hit the 
public site yet.

> Link to building spark returns a 404
> 
>
> Key: SPARK-3725
> URL: https://issues.apache.org/jira/browse/SPARK-3725
> Project: Spark
>  Issue Type: Documentation
>Reporter: Anant Daksh Asthana
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152047#comment-14152047
 ] 

Anant Daksh Asthana commented on SPARK-3725:


Would it make sense to add a building spark document in the repo. This will 
make it easier to find documentation and any one who has the source will have 
the docs for it as well.

> Link to building spark returns a 404
> 
>
> Key: SPARK-3725
> URL: https://issues.apache.org/jira/browse/SPARK-3725
> Project: Spark
>  Issue Type: Documentation
>Reporter: Anant Daksh Asthana
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152045#comment-14152045
 ] 

Sean Owen commented on SPARK-3725:
--

The link is correct in the doc source, but the public site needs to be rebuilt 
to get the new page referenced by README.md.

> Link to building spark returns a 404
> 
>
> Key: SPARK-3725
> URL: https://issues.apache.org/jira/browse/SPARK-3725
> Project: Spark
>  Issue Type: Documentation
>Reporter: Anant Daksh Asthana
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3728) RandomForest: Learn models too large to store in memory

2014-09-29 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3728:


 Summary: RandomForest: Learn models too large to store in memory
 Key: SPARK-3728
 URL: https://issues.apache.org/jira/browse/SPARK-3728
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


Proposal: Write trees to disk as they are learned.

RandomForest currently uses a FIFO queue, which means training all trees at 
once via breadth-first search.  Using a FILO queue would encourage the code to 
finish one tree before moving on to new ones.  This would allow the code to 
write trees to disk as they are learned.

Note: It would also be possible to write nodes to disk as they are learned 
using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
[Sequoia Forest package]() does this.  However, it could be useful to learn 
trees progressively, so that future functionality such as early stopping 
(training fewer trees than expected) could be supported.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2014-09-29 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3727:


 Summary: DecisionTree, RandomForest: More prediction functionality
 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


DecisionTree and RandomForest currently predict the most likely label for 
classification and the mean for regression.  Other info about predictions would 
be useful.

For classification: estimated probability of each possible label
For regression: variance of estimate

RandomForest could also create aggregate predictions in multiple ways:
* Predict mean or median value for regression.
* Compute variance of estimates (across all trees) for both classification and 
regression.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3726) RandomForest: Support for bootstrap options

2014-09-29 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3726:


 Summary: RandomForest: Support for bootstrap options
 Key: SPARK-3726
 URL: https://issues.apache.org/jira/browse/SPARK-3726
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
The expected size of each sample is the same as the original data (sampling 
rate = 1.0), and sampling is done with replacement.  Adding support for other 
sampling rates and for sampling without replacement would be useful.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Anant Daksh Asthana (JIRA)
Anant Daksh Asthana created SPARK-3725:
--

 Summary: Link to building spark returns a 404
 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor


The README.md link to "Building Spark" returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2331) SparkContext.emptyRDD has wrong return type

2014-09-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152018#comment-14152018
 ] 

Patrick Wendell edited comment on SPARK-2331 at 9/29/14 6:34 PM:
-

Yeah we could have made this a wider type in the public signature. However, it 
is not possible to do that while maintaining compatibility (others may be 
relying on this returning an EmptyRDD).

This issue is not related to covariance because here the type parameter is 
always String. So here the compiler actually does understand that 
EmptyRDD[String] is a sub type of RDD[String].

The issue with the original example is that the Scala compiler will always try 
to infer the narrowest type it can. So in a foldLeft expression it will by 
default assume the resulting type is EmptyRDD unless you up cast it to a more 
general type like you are doing. And the union operation requires an exact type 
match on the two RDD's, including the type parameter.


was (Author: pwendell):
Yeah we could have made this a wider type in the public signature. However, it 
is not possible to do that while maintaining compatibility (others may be 
relying on this returning an EmptyRDD).

For now though you can safely cast it to work around this:

{code}
scala> sc.emptyRDD[String].asInstanceOf[RDD[String]]
res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at :14
{code}

> SparkContext.emptyRDD has wrong return type
> ---
>
> Key: SPARK-2331
> URL: https://issues.apache.org/jira/browse/SPARK-2331
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Ian Hummel
>
> The return type for SparkContext.emptyRDD is EmptyRDD[T].
> It should be RDD[T].  That means you have to add extra type annotations on 
> code like the below (which creates a union of RDDs over some subset of paths 
> in a folder)
> val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { 
> (rdd, path) ⇒
>   rdd.union(sc.textFile(path))
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3724) RandomForest: More options for feature subset size

2014-09-29 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3724:


 Summary: RandomForest: More options for feature subset size
 Key: SPARK-3724
 URL: https://issues.apache.org/jira/browse/SPARK-3724
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


RandomForest currently supports using a few values for the number of features 
to sample per node: all, sqrt, log2, etc.  It should support any given value 
(to allow model search).

Proposal: If the parameter for specifying the number of features per node is 
not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a numerical 
value.  The value should be either (a) a real value in [0,1] specifying the 
fraction of features in each subset or (b) an integer value specifying the 
number of features in each subset.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2331) SparkContext.emptyRDD has wrong return type

2014-09-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2331.

Resolution: Won't Fix

> SparkContext.emptyRDD has wrong return type
> ---
>
> Key: SPARK-2331
> URL: https://issues.apache.org/jira/browse/SPARK-2331
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Ian Hummel
>
> The return type for SparkContext.emptyRDD is EmptyRDD[T].
> It should be RDD[T].  That means you have to add extra type annotations on 
> code like the below (which creates a union of RDDs over some subset of paths 
> in a folder)
> val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { 
> (rdd, path) ⇒
>   rdd.union(sc.textFile(path))
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2014-09-29 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3723:


 Summary: DecisionTree, RandomForest: Add more instrumentation
 Key: SPARK-3723
 URL: https://issues.apache.org/jira/browse/SPARK-3723
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Some simple instrumentation would help advanced users understand performance, 
and to check whether parameters (such as maxMemoryInMB) need to be tuned.

Most important instrumentation (simple):
* min, avg, max nodes per group
* number of groups (passes over data)

More advanced instrumentation:
* For each tree (or averaged over trees), training set accuracy after training 
each level.  This would be useful for visualizing learning behavior (to 
convince oneself that model selection was being done correctly).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type

2014-09-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152018#comment-14152018
 ] 

Patrick Wendell commented on SPARK-2331:


Yeah we could have made this a wider type in the public signature. However, it 
is not possible to do that while maintaining compatibility (others may be 
relying on this returning an EmptyRDD).

For now though you can safely cast it to work around this:

{code}
scala> sc.emptyRDD[String].asInstanceOf[RDD[String]]
res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at :14
{code}

> SparkContext.emptyRDD has wrong return type
> ---
>
> Key: SPARK-2331
> URL: https://issues.apache.org/jira/browse/SPARK-2331
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Ian Hummel
>
> The return type for SparkContext.emptyRDD is EmptyRDD[T].
> It should be RDD[T].  That means you have to add extra type annotations on 
> code like the below (which creates a union of RDDs over some subset of paths 
> in a folder)
> val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { 
> (rdd, path) ⇒
>   rdd.union(sc.textFile(path))
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >