[jira] [Commented] (SPARK-4338) Remove yarn-alpha support

2014-11-11 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206104#comment-14206104
 ] 

Sandy Ryza commented on SPARK-4338:
---

Planning to take a stab at this

 Remove yarn-alpha support
 -

 Key: SPARK-4338
 URL: https://issues.apache.org/jira/browse/SPARK-4338
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4338) Remove yarn-alpha support

2014-11-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4338:
-

 Summary: Remove yarn-alpha support
 Key: SPARK-4338
 URL: https://issues.apache.org/jira/browse/SPARK-4338
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3647) Shaded Guava patch causes access issues with package private classes

2014-11-11 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3647:
--
Fix Version/s: 1.2.0

 Shaded Guava patch causes access issues with package private classes
 

 Key: SPARK-3647
 URL: https://issues.apache.org/jira/browse/SPARK-3647
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Critical
 Fix For: 1.2.0


 The patch that introduced shading to Guava (SPARK-2848) tried to maintain 
 backwards compatibility in the Java API by not relocating the Optional 
 class. That causes problems when that class references package private 
 members in the Absent and Present classes, which are now in a different 
 package:
 {noformat}
 Exception in thread main java.lang.IllegalAccessError: tried to access 
 class org.spark-project.guava.common.base.Present from class 
 com.google.common.base.Optional
   at com.google.common.base.Optional.of(Optional.java:86)
   at 
 org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25)
   at 
 org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes

2014-11-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206112#comment-14206112
 ] 

Andrew Ash commented on SPARK-3647:
---

Based on poking at the git repo below I'm marking with a fix version of 1.2.0 
(the next release on branch-1.2)

{noformat}
aash@aash-mbp ~/git/spark$ git log origin/master | grep SPARK-3647
[SPARK-3647] Add more exceptions to Guava relocation.
Closes #2496 from vanzin/SPARK-3647 and squashes the following commits:
84f58d7 [Marcelo Vanzin] [SPARK-3647] Add more exceptions to Guava 
relocation.
aash@aash-mbp ~/git/spark$ git log origin/branch-1.0 | grep SPARK-3647
aash@aash-mbp ~/git/spark$ git log origin/branch-1.1 | grep SPARK-3647
aash@aash-mbp ~/git/spark$ git log origin/branch-1.2 | grep SPARK-3647
[SPARK-3647] Add more exceptions to Guava relocation.
Closes #2496 from vanzin/SPARK-3647 and squashes the following commits:
84f58d7 [Marcelo Vanzin] [SPARK-3647] Add more exceptions to Guava 
relocation.
aash@aash-mbp ~/git/spark$
{noformat}

 Shaded Guava patch causes access issues with package private classes
 

 Key: SPARK-3647
 URL: https://issues.apache.org/jira/browse/SPARK-3647
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Critical
 Fix For: 1.2.0


 The patch that introduced shading to Guava (SPARK-2848) tried to maintain 
 backwards compatibility in the Java API by not relocating the Optional 
 class. That causes problems when that class references package private 
 members in the Absent and Present classes, which are now in a different 
 package:
 {noformat}
 Exception in thread main java.lang.IllegalAccessError: tried to access 
 class org.spark-project.guava.common.base.Present from class 
 com.google.common.base.Optional
   at com.google.common.base.Optional.of(Optional.java:86)
   at 
 org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25)
   at 
 org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4339) Use configuration instead of constant

2014-11-11 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-4339:
-

 Summary: Use configuration instead of constant
 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-572) Forbid update of static mutable variables

2014-11-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206143#comment-14206143
 ] 

Andrew Ash commented on SPARK-572:
--

Static mutable variables are now a standard way of having code run on a 
per-executor basis.

To run per-entry, you can use map(), for per-partition you can use 
mapPartitions(), but for per-executor you need static variables or 
initializers.  If for example you want to open a connection to another data 
storage system and write all of an executor's data into that system, a static 
connection object is the common way to do that.

I would propose closing this ticket as Won't Fix.  Using this technique is 
confusing, but prohibiting it is difficult and introduces additional roadblocks 
to Spark power users.

cc [~rxin]

 Forbid update of static mutable variables
 -

 Key: SPARK-572
 URL: https://issues.apache.org/jira/browse/SPARK-572
 Project: Spark
  Issue Type: Improvement
Reporter: tjhunter

 Consider the following piece of code:
 pre
 object Foo {
  var xx = -1
  def main() {
xx = 1
val sc = new SparkContext(...)
sc.broadcast(xx)
sc.parallelize(0 to 10).map(i={ ... xx ...})
  }
 }
 /pre
 Can you guess the value of xx? It is 1 when you use the local scheduler and 
 -1 when you use the mesos scheduler. Given the complications, it should 
 probably just be forbidden for now...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Use configuration instead of constant

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Description: fixedPoint  limits the max number of iterations,it should be 
Configurable.

 Use configuration instead of constant
 -

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-572) Forbid update of static mutable variables

2014-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-572.
-
Resolution: Won't Fix

Closing this as won't fix since it is very hard to enforce and we do abuse it 
to run stateful computation.

 Forbid update of static mutable variables
 -

 Key: SPARK-572
 URL: https://issues.apache.org/jira/browse/SPARK-572
 Project: Spark
  Issue Type: Improvement
Reporter: tjhunter

 Consider the following piece of code:
 pre
 object Foo {
  var xx = -1
  def main() {
xx = 1
val sc = new SparkContext(...)
sc.broadcast(xx)
sc.parallelize(0 to 10).map(i={ ... xx ...})
  }
 }
 /pre
 Can you guess the value of xx? It is 1 when you use the local scheduler and 
 -1 when you use the mesos scheduler. Given the complications, it should 
 probably just be forbidden for now...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Make fixedPoint Configurable in Analyzer.scala

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Summary: Make fixedPoint Configurable in Analyzer.scala  (was: Use 
configuration instead of constant)

 Make fixedPoint Configurable in Analyzer.scala
 --

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Make fixedPoint Configurable in Analyzer.scala

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Description: 
fixedPoint  limits the max number of iterations,it should be Configurable.But 
it is a contant in analyzer.scala


  was:fixedPoint  limits the max number of iterations,it should be Configurable.


 Make fixedPoint Configurable in Analyzer.scala
 --

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.But 
 it is a contant in analyzer.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Make fixedPoint Configurable in Analyzer

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Summary: Make fixedPoint Configurable in Analyzer  (was: Make fixedPoint 
Configurable in Analyzer.scala)

 Make fixedPoint Configurable in Analyzer
 

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.But 
 it is a contant in analyzer.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Make fixedPoint Configurable in Analyzer

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Description: 
fixedPoint  limits the max number of iterations,it should be Configurable.But 
it is a contant in analyzer.scala,like that val fixedPoint = FixedPoint(100).


  was:
fixedPoint  limits the max number of iterations,it should be Configurable.But 
it is a contant in analyzer.scala,like that val fixedPoint = FixedPoint(100).



 Make fixedPoint Configurable in Analyzer
 

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.But 
 it is a contant in analyzer.scala,like that val fixedPoint = 
 FixedPoint(100).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Make fixedPoint Configurable in Analyzer

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Description: 
fixedPoint  limits the max number of iterations,it should be Configurable.
But it is a contant in analyzer.scala,like that val fixedPoint = 
FixedPoint(100).


  was:
fixedPoint  limits the max number of iterations,it should be Configurable.But 
it is a contant in analyzer.scala,like that val fixedPoint = FixedPoint(100).



 Make fixedPoint Configurable in Analyzer
 

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.
 But it is a contant in analyzer.scala,like that val fixedPoint = 
 FixedPoint(100).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4339) Make fixedPoint Configurable in Analyzer

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4339:
--
Description: 
fixedPoint  limits the max number of iterations,it should be Configurable.But 
it is a contant in analyzer.scala,like that val fixedPoint = FixedPoint(100).


  was:
fixedPoint  limits the max number of iterations,it should be Configurable.But 
it is a contant in analyzer.scala



 Make fixedPoint Configurable in Analyzer
 

 Key: SPARK-4339
 URL: https://issues.apache.org/jira/browse/SPARK-4339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 fixedPoint  limits the max number of iterations,it should be Configurable.But 
 it is a contant in analyzer.scala,like that val fixedPoint = FixedPoint(100).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-632) Akka system names need to be normalized (since they are case-sensitive)

2014-11-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206158#comment-14206158
 ] 

Andrew Ash commented on SPARK-632:
--

// link moved to 
http://doc.akka.io/docs/akka/current/additional/faq.html#what-is-the-name-of-a-remote-actor

I believe having the hostname change case will still break Spark.  But after a 
search of the dev and user mailing lists over the past year I haven't seen any 
other users with this issue.

A potential fix could be to call .toLower on the hostname in the Akka string 
across the cluster, but it's a little dirty to make this assumption everywhere.

Technically [hostnames ARE case 
insensitive|http://serverfault.com/questions/261341/is-the-hostname-case-sensitive]
 so Spark's behavior is wrong, but the issue is in the underlying Akka library. 
 This is the same underlying behavior where Akka requires that hostnames 
exactly match as well -- you can't use an IP address to refer to a Akka 
listening on a hostname -- SPARK-625.

Until Akka handles differently-cased hostnames I think can only be done with an 
ugly workaround.

Possibly relevant Akka issues:
- https://github.com/akka/akka/issues/15990
- https://github.com/akka/akka/issues/15007

My preference would be to close this as Won't Fix until it's raised again as 
a problem from the community.

cc [~rxin]

 Akka system names need to be normalized (since they are case-sensitive)
 ---

 Key: SPARK-632
 URL: https://issues.apache.org/jira/browse/SPARK-632
 Project: Spark
  Issue Type: Bug
Reporter: Matt Massie

 The system name of the Akka full path is case-sensitive (see 
 http://akka.io/faq/#what_is_the_name_of_a_remote_actor).
 Since DNS names are case-insensitive and we're using them in the system 
 name, we need to normalize them (e.g. make them all lowercase).  Otherwise, 
 users will find the workers will not be able to connect with the master 
 even though the URI appears to be correct.
 For example, Berkeley DNS occasionally uses names e.g. foo.Berkley.EDU. If I 
 used foo.berkeley.edu as the master adddress, the workers would write to 
 their logs that they are connecting to foo.berkeley.edu but failed to. They 
 never show up in the master UI.  If use the foo.Berkeley.EDU address, 
 everything works as it should. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4340) add java opts argument substitute to avoid gc log overwritten

2014-11-11 Thread Haitao Yao (JIRA)
Haitao Yao created SPARK-4340:
-

 Summary: add java opts argument substitute to avoid gc log 
overwritten
 Key: SPARK-4340
 URL: https://issues.apache.org/jira/browse/SPARK-4340
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Haitao Yao
Priority: Minor


In standalone mode, if more executors are assigned to 1 host, the gc log will 
be overwritten. so I add {{CORE_ID}}, {{EXECUTOR_ID}}, {{APP_ID}} substitute to 
support configuration with APP_ID

Heres' the push request.
https://github.com/apache/spark/pull/3205
Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-689) Task will crash when setting SPARK_WORKER_CORES 128

2014-11-11 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-689:
-
Description: 
when I set SPARK_WORKER_CORES  128(for example 200), and run a job in 
standalone mode that will allocate 200 tasks in one worker node, then task will 
crash(it seems that worker cores has been hard-code)

{noformat}
13/02/07 11:25:02 ERROR StandaloneExecutorBackend: Task 
spark.executor.Executor$TaskRunner@5367839e rejected from 
java.util.concurrent.ThreadPoolExecutor@30f224d9[Running, pool size = 128, 
active threads = 128, queued tasks = 0, completed tasks = 0]
java.util.concurrent.RejectedExecutionException: Task 
spark.executor.Executor$TaskRunner@5367839e rejected from 
java.util.concurrent.ThreadPoolExecutor@30f224d9[Running, pool size = 128, 
active threads = 128, queued tasks = 0, completed tasks = 0]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2013)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:816)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1337)
at spark.executor.Executor.launchTask(Executor.scala:59)
at 
spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:57)
at 
spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:46)
at akka.actor.Actor$class.apply(Actor.scala:318)
at 
spark.executor.StandaloneExecutorBackend.apply(StandaloneExecutorBackend.scala:17)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at 
akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
13/02/07 11:25:02 INFO StandaloneExecutorBackend: Connecting to master: 
akka://spark@10.0.2.19:60882/user/StandaloneScheduler
13/02/07 11:25:02 INFO StandaloneExecutorBackend: Got assigned task 1929
13/02/07 11:25:02 INFO Executor: launch taskId: 1929
13/02/07 11:25:02 ERROR StandaloneExecutorBackend: 
java.lang.NullPointerException
at spark.executor.Executor.launchTask(Executor.scala:59)
at 
spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:57)
at 
spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:46)
at akka.actor.Actor$class.apply(Actor.scala:318)
at 
spark.executor.StandaloneExecutorBackend.apply(StandaloneExecutorBackend.scala:17)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at 
akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
13/02/07 11:25:02 INFO StandaloneExecutorBackend: Connecting to master: 
akka://spark@10.0.2.19:60882/user/StandaloneScheduler
13/02/07 11:25:02 INFO StandaloneExecutorBackend: Got assigned task 1930
13/02/07 11:25:02 INFO Executor: launch taskId: 1930
13/02/07 11:25:02 ERROR StandaloneExecutorBackend: 
java.lang.NullPointerException
at spark.executor.Executor.launchTask(Executor.scala:59)
at 
spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:57)
at 
spark.executor.StandaloneExecutorBackend$$anonfun$receive$1.apply(StandaloneExecutorBackend.scala:46)
at akka.actor.Actor$class.apply(Actor.scala:318)
at 
spark.executor.StandaloneExecutorBackend.apply(StandaloneExecutorBackend.scala:17)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at 
akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

[jira] [Commented] (SPARK-650) Add a setup hook API for running initialization code on each executor

2014-11-11 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206182#comment-14206182
 ] 

Andrew Ash commented on SPARK-650:
--

As mentioned in SPARK-572 static classes' initialization methods are being 
abused to perform this functionality.

[~matei] do you still feel that a per-executor initialization function is a 
hook that Spark should expose in its public API?

 Add a setup hook API for running initialization code on each executor
 ---

 Key: SPARK-650
 URL: https://issues.apache.org/jira/browse/SPARK-650
 Project: Spark
  Issue Type: New Feature
Reporter: Matei Zaharia
Priority: Minor

 Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-632) Akka system names need to be normalized (since they are case-sensitive)

2014-11-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206197#comment-14206197
 ] 

Reynold Xin commented on SPARK-632:
---

Sounds good. In the future we might roll our own RPC rather than using Actor 
for RPC. I think the current RPC library built for the shuffle service is 
already ok with case insensitive hostnames.

 Akka system names need to be normalized (since they are case-sensitive)
 ---

 Key: SPARK-632
 URL: https://issues.apache.org/jira/browse/SPARK-632
 Project: Spark
  Issue Type: Bug
Reporter: Matt Massie

 The system name of the Akka full path is case-sensitive (see 
 http://akka.io/faq/#what_is_the_name_of_a_remote_actor).
 Since DNS names are case-insensitive and we're using them in the system 
 name, we need to normalize them (e.g. make them all lowercase).  Otherwise, 
 users will find the workers will not be able to connect with the master 
 even though the URI appears to be correct.
 For example, Berkeley DNS occasionally uses names e.g. foo.Berkley.EDU. If I 
 used foo.berkeley.edu as the master adddress, the workers would write to 
 their logs that they are connecting to foo.berkeley.edu but failed to. They 
 never show up in the master UI.  If use the foo.Berkeley.EDU address, 
 everything works as it should. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-632) Akka system names need to be normalized (since they are case-sensitive)

2014-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-632.
-
Resolution: Fixed

 Akka system names need to be normalized (since they are case-sensitive)
 ---

 Key: SPARK-632
 URL: https://issues.apache.org/jira/browse/SPARK-632
 Project: Spark
  Issue Type: Bug
Reporter: Matt Massie

 The system name of the Akka full path is case-sensitive (see 
 http://akka.io/faq/#what_is_the_name_of_a_remote_actor).
 Since DNS names are case-insensitive and we're using them in the system 
 name, we need to normalize them (e.g. make them all lowercase).  Otherwise, 
 users will find the workers will not be able to connect with the master 
 even though the URI appears to be correct.
 For example, Berkeley DNS occasionally uses names e.g. foo.Berkley.EDU. If I 
 used foo.berkeley.edu as the master adddress, the workers would write to 
 their logs that they are connecting to foo.berkeley.edu but failed to. They 
 never show up in the master UI.  If use the foo.Berkeley.EDU address, 
 everything works as it should. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4341) Spark need to set num-executors automatically

2014-11-11 Thread Hong Shen (JIRA)
Hong Shen created SPARK-4341:


 Summary: Spark need to set num-executors automatically
 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen


The mapreduce job can set maptask automaticlly, but in spark, we have to set 
num-executors, executor memory and cores. It's difficult for users to set these 
args, especially for the users want to use spark sql. So when user havn't set 
num-executors,  spark should set num-executors automatically accroding to the 
input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206216#comment-14206216
 ] 

Sean Owen commented on SPARK-4341:
--

I don't agree with this. How would Spark know a priori the number and spec of 
machines you have? how would it know how to balance its desire to grab it all, 
vs yours to not commit everything to Spark.

MapReduce does *not* set the amount of resource it consumes on the host machine 
automatically. This is up to the administrator.

The number of map tasks in a job is set by MR, but that's different. Spark does 
the same thing already since it uses the same InputSplits. MR does not set the 
number of reducers.

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-11 Thread maji2014 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maji2014 updated SPARK-4314:

Description: 
[Reproduce]
 1. Run HdfsWordCount interface, such as ssc.textFileStream(args(0))
 2. Upload file to hdfs(reason as followings)
 3. Exception as followings.

[Exception stack]
 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
master/192.168.84.142:9000 from ocdc sending #13
 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
1415611274000 ms
 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: hdfs://master:9000/user/spark/200.COPYING
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
 at 
org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
 at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
 at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
 at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
 at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:115)
 at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
 at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:221)
 at scala.util.Try$.apply(Try.scala:161)
 at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:221)
 at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:165)
 at 

[jira] [Issue Comment Deleted] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file

2014-11-11 Thread maji2014 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maji2014 updated SPARK-4314:

Comment: was deleted

(was: Yes, Not all of this intermediate state are caught. 
i wanna add following code into defaultFilter method under FileInputDStream.
Any suggestions?)

 Exception when textFileStream attempts to read deleted _COPYING_ file
 -

 Key: SPARK-4314
 URL: https://issues.apache.org/jira/browse/SPARK-4314
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: maji2014

 [Reproduce]
  1. Run HdfsWordCount interface, such as ssc.textFileStream(args(0))
  2. Upload file to hdfs(reason as followings)
  3. Exception as followings.
 [Exception stack]
  14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to 
 master/192.168.84.142:9000 from ocdc sending #13
  14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time 
 1415611274000 ms
  org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
 not exist: hdfs://master:9000/user/spark/200.COPYING
  at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
  at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at 
 org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125)
  at 
 org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124)
  at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
 org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124)
  at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
  at 
 org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40)
  at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
  at 
 org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291)
  at 
 org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
  at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
  at 
 org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
  at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
  at 
 

[jira] [Resolved] (SPARK-4295) [External]Exception throws in SparkSinkSuite although all test cases pass

2014-11-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4295.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

 [External]Exception throws in SparkSinkSuite although all test cases pass
 -

 Key: SPARK-4295
 URL: https://issues.apache.org/jira/browse/SPARK-4295
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: maji2014
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 [reproduce]
 Run test suite normally, after the first test case, all other test cases 
 throw javax.management.InstanceAlreadyExistsException: 
 org.apache.flume.channel:type=null 
 [exception stack]
 exception as followings:
 14/11/07 00:24:51 ERROR MonitoredCounterGroup: Failed to register monitored 
 counter group for type: CHANNEL, name: null
 javax.management.InstanceAlreadyExistsException: 
 org.apache.flume.channel:type=null
   at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
   at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
   at 
 org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:108)
   at 
 org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:88)
   at org.apache.flume.channel.MemoryChannel.start(MemoryChannel.java:345)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply$mcV$sp(SparkSinkSuite.scala:63)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply(SparkSinkSuite.scala:61)
   at 
 org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply(SparkSinkSuite.scala:61)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
   at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
   at 
 org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
   at 
 

[jira] [Resolved] (SPARK-2492) KafkaReceiver minor changes to align with Kafka 0.8

2014-11-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-2492.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 KafkaReceiver minor changes to align with Kafka 0.8 
 

 Key: SPARK-2492
 URL: https://issues.apache.org/jira/browse/SPARK-2492
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Minor
 Fix For: 1.2.0


 Update to delete Zookeeper metadata when Kafka's parameter 
 auto.offset.reset is set to smallest, which is aligned with Kafka 0.8's 
 ConsoleConsumer.
 Also use Kafka offered API without directly using zkClient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically

2014-11-11 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206263#comment-14206263
 ] 

Hong Shen commented on SPARK-4341:
--

There must be some relation between inputSplits, num-executors and spark 
parallelism, for example, if inputSplits (determine partitions of input rdd) is 
less than num-executors,  it will lead to a waste of resources, if inputSplits 
much bigger than num-executor, job will last a long time. It's the same to 
num-executors and spark parallelism.  So if we want spark widely used, it's 
should set by spark automatically.

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4326) unidoc is broken on master

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206265#comment-14206265
 ] 

Sean Owen commented on SPARK-4326:
--

Hm. {{hashInt}} isn't in Guava 11, but is in 12. This leads me to believe that 
unidoc is picking up Guava 11 from Hadoop, and not Guava 14 from Spark since 
it's shaded. I would like to phone a friend: [~vanzin]

 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^
 [info] No documentation generated with unsucessful compiler run
 [warn] two warnings found
 [error] 6 errors found
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM
 {code}
 It doesn't happen on branch-1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4340) add java opts argument substitute to avoid gc log overwritten

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206286#comment-14206286
 ] 

Apache Spark commented on SPARK-4340:
-

User 'haitaoyao' has created a pull request for this issue:
https://github.com/apache/spark/pull/3205

 add java opts argument substitute to avoid gc log overwritten
 -

 Key: SPARK-4340
 URL: https://issues.apache.org/jira/browse/SPARK-4340
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Haitao Yao
Priority: Minor

 In standalone mode, if more executors are assigned to 1 host, the gc log will 
 be overwritten. so I add {{CORE_ID}}, {{EXECUTOR_ID}}, {{APP_ID}} substitute 
 to support configuration with APP_ID
 Heres' the push request.
 https://github.com/apache/spark/pull/3205
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206288#comment-14206288
 ] 

Sean Owen commented on SPARK-4341:
--

Yes, but, the executors live as long as the app does. The app may invoke lots 
of operations, large and small, with different numbers of partitions each. It 
is not like MR, where one MR execute one map and one reduce. Too many splits 
does not waste resources; it means you incur the overhead of launching more 
tasks, but that's relatively small.

Concretely, how do you propose to set this automatically?

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4342) connection ack timeout improvement, replace Timer with ScheudledExecutor...

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206305#comment-14206305
 ] 

Apache Spark commented on SPARK-4342:
-

User 'haitaoyao' has created a pull request for this issue:
https://github.com/apache/spark/pull/3207

 connection ack timeout improvement, replace Timer with ScheudledExecutor...
 ---

 Key: SPARK-4342
 URL: https://issues.apache.org/jira/browse/SPARK-4342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Haitao Yao

 replace java.util.Timer with scheduledExecutorService, use message id 
 directly in the task.
 for details, see the mailing list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2055) bin$ ./run-example is bad. must run SPARK_HOME$ bin/run-example. look at the file run-example at line 54.

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2055.
--
Resolution: Invalid

I think this is obsolete or at least quite unclear. The script does control the 
directory from which spark-submit is run explicitly on about line 54.

 bin$ ./run-example is bad.  must run SPARK_HOME$ bin/run-example. look at the 
 file run-example at line 54.
 --

 Key: SPARK-2055
 URL: https://issues.apache.org/jira/browse/SPARK-2055
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.0.0
Reporter: Peerless.feng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1858) Update third-party Hadoop distros doc to list more distros

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206353#comment-14206353
 ] 

Sean Owen commented on SPARK-1858:
--

Same, this is one of two issues left under the stale-sounding 
https://issues.apache.org/jira/browse/SPARK-1351 . Are any more distros on the 
radar that aren't on the page?

 Update third-party Hadoop distros doc to list more distros
 

 Key: SPARK-1858
 URL: https://issues.apache.org/jira/browse/SPARK-1858
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206352#comment-14206352
 ] 

Sean Owen commented on SPARK-1564:
--

This is one of two issues left under the stale-sounding 
https://issues.apache.org/jira/browse/SPARK-1351 . Can this be turned loose 
into a floating issue for 1.3+ or marked WontFix?

 Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
 -

 Key: SPARK-1564
 URL: https://issues.apache.org/jira/browse/SPARK-1564
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Andrew Or
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2092) This is a test issue

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2092.
--
Resolution: Not a Problem

Patrick can these test issues be closed? I'm just looking over old issues. This 
looks like an old test.

 This is a test issue
 

 Key: SPARK-2092
 URL: https://issues.apache.org/jira/browse/SPARK-2092
 Project: Spark
  Issue Type: New Feature
Reporter: Test
Assignee: Test





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1943) Testing use of target version field

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1943.
--
Resolution: Not a Problem

Patrick can these test issues be closed? I'm just looking over old issues. This 
looks like an old test.

 Testing use of target version field
 ---

 Key: SPARK-1943
 URL: https://issues.apache.org/jira/browse/SPARK-1943
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Patrick Wendell





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4196) Streaming + checkpointing + saveAsNewAPIHadoopFiles = NotSerializableException for Hadoop Configuration

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4196:
-
Summary: Streaming + checkpointing + saveAsNewAPIHadoopFiles = 
NotSerializableException for Hadoop Configuration  (was: Streaming + 
checkpointing yields NotSerializableException for Hadoop Configuration from 
saveAsNewAPIHadoopFiles ?)

More info. The problem is that {{CheckpointWriter}} serializes the 
{{DStreamGraph}} when checkpointing is enabled. In the case of, for example, 
{{saveAsNewAPIHadoopFiles}}, this includes a {{ForEachDStream}} with a 
reference to a Hadoop {{Configuration}}.

This isn't a problem without checkpointing because Spark is not going to need 
to serialize this {{ForEachDStream}} closure to execute it in general. But it 
does to checkpoint it.

Does that make sense? I'm not sure what to do but this is presenting a 
significant problem to me as I can't see a sly workaround to make streaming, 
with saving Hadoop files, with checkpointing, to work.


Here's a cobbled-together test that shows the problem:

{code}
  test(recovery with save to HDFS stream) {
// Set up the streaming context and input streams
val testDir = Utils.createTempDir()
val outDir = Utils.createTempDir()
var ssc = new StreamingContext(master, framework, Seconds(1))
ssc.checkpoint(checkpointDir)
val fileStream = ssc.textFileStream(testDir.toString)
for (i - Seq(1, 2, 3)) {
  Files.write(i + \n, new File(testDir, i.toString), 
Charset.forName(UTF-8))
  // wait to make sure that the file is written such that it gets shown in 
the file listings
}

val reducedStream = fileStream.map(x = (x, x)).saveAsNewAPIHadoopFiles(
  outDir.toURI.toString,
  saveAsNewAPIHadoopFilesTest,
  classOf[Text],
  classOf[Text],
  classOf[TextOutputFormat[Text,Text]],
  ssc.sparkContext.hadoopConfiguration)

ssc.start()
ssc.awaitTermination(5000)
ssc.stop()

val checkpointDirFile = new File(checkpointDir)
assert(outDir.listFiles().length  0)
assert(checkpointDirFile.listFiles().length == 1)
assert(checkpointDirFile.listFiles()(0).listFiles().length  0)

Utils.deleteRecursively(testDir)
Utils.deleteRecursively(outDir)
  }
{code}

You'll see the {{NotSerializableException}} clearly if you hack 
{{Checkpoint.write()}}:

{code}
  def write(checkpoint: Checkpoint) {
val bos = new ByteArrayOutputStream()
val zos = compressionCodec.compressedOutputStream(bos)
val oos = new ObjectOutputStream(zos)
try {
  oos.writeObject(checkpoint)
} catch {
  case e: Exception =
e.printStackTrace()
throw e
}
...
{code}

 Streaming + checkpointing + saveAsNewAPIHadoopFiles = 
 NotSerializableException for Hadoop Configuration
 ---

 Key: SPARK-4196
 URL: https://issues.apache.org/jira/browse/SPARK-4196
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Sean Owen

 I am reasonably sure there is some issue here in Streaming and that I'm not 
 missing something basic, but not 100%. I went ahead and posted it as a JIRA 
 to track, since it's come up a few times before without resolution, and right 
 now I can't get checkpointing to work at all.
 When Spark Streaming checkpointing is enabled, I see a 
 NotSerializableException thrown for a Hadoop Configuration object, and it 
 seems like it is not one from my user code.
 Before I post my particular instance see 
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E
  for another occurrence.
 I was also on customer site last week debugging an identical issue with 
 checkpointing in a Scala-based program and they also could not enable 
 checkpointing without hitting exactly this error.
 The essence of my code is:
 {code}
 final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
 JavaStreamingContextFactory streamingContextFactory = new
 JavaStreamingContextFactory() {
   @Override
   public JavaStreamingContext create() {
 return new JavaStreamingContext(sparkContext, new
 Duration(batchDurationMS));
   }
 };
   streamingContext = JavaStreamingContext.getOrCreate(
   checkpointDirString, sparkContext.hadoopConfiguration(),
 streamingContextFactory, false);
   streamingContext.checkpoint(checkpointDirString);
 {code}
 It yields:
 {code}
 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
 org.apache.hadoop.conf.Configuration
 - field (class 
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
 name: conf$2, type: class org.apache.hadoop.conf.Configuration)
 - object (class
 

[jira] [Updated] (SPARK-1825) Windows Spark fails to work with Linux YARN

2014-11-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ángel Álvarez updated SPARK-1825:
-
Attachment: SPARK-1825.patch

Is it really necessary to change the file ExecutorRunnableUtil.scala?

I'd just changed the file ClientBase.scala and it (apparently) works for 
Spark 1.1.

In order to make it work, you'll have to add the following configuration

- Program arguments: --master yarn-cluster
- VM arguments: -Dspark.app-submission.cross-platform=true



 Windows Spark fails to work with Linux YARN
 ---

 Key: SPARK-1825
 URL: https://issues.apache.org/jira/browse/SPARK-1825
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Taeyun Kim
 Fix For: 1.2.0

 Attachments: SPARK-1825.patch


 Windows Spark fails to work with Linux YARN.
 This is a cross-platform problem.
 This error occurs when 'yarn-client' mode is used.
 (yarn-cluster/yarn-standalone mode was not tested.)
 On YARN side, Hadoop 2.4.0 resolved the issue as follows:
 https://issues.apache.org/jira/browse/YARN-1824
 But Spark YARN module does not incorporate the new YARN API yet, so problem 
 persists for Spark.
 First, the following source files should be changed:
 - /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
 - 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
 Change is as follows:
 - Replace .$() to .$$()
 - Replace File.pathSeparator for Environment.CLASSPATH.name to 
 ApplicationConstants.CLASS_PATH_SEPARATOR (import 
 org.apache.hadoop.yarn.api.ApplicationConstants is required for this)
 Unless the above are applied, launch_container.sh will contain invalid shell 
 script statements(since they will contain Windows-specific separators), and 
 job will fail.
 Also, the following symptom should also be fixed (I could not find the 
 relevant source code):
 - SPARK_HOME environment variable is copied straight to launch_container.sh. 
 It should be changed to the path format for the server OS, or, the better, a 
 separate environment variable or a configuration variable should be created.
 - '%HADOOP_MAPRED_HOME%' string still exists in launch_container.sh, after 
 the above change is applied. maybe I missed a few lines.
 I'm not sure whether this is all, since I'm new to both Spark and YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1493.
--
Resolution: Fixed

Looks like this was fixed? There are no paths of this form in .rat-excludes now.

 Apache RAT excludes don't work with file path (instead of file name)
 

 Key: SPARK-1493
 URL: https://issues.apache.org/jira/browse/SPARK-1493
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
  Labels: starter

 Right now the way we do RAT checks, it doesn't work if you try to exclude:
 /path/to/file.ext
 you have to just exclude
 file.ext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206406#comment-14206406
 ] 

Sean Owen commented on SPARK-4301:
--

Sorry, a bit late, but I noticed this is pretty related to 
https://issues.apache.org/jira/browse/SPARK-2645 which discusses calling stop() 
twice.

 StreamingContext should not allow start() to be called after calling stop()
 ---

 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.1.1, 1.2.0, 1.0.3


 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
 started is a no-op which has no side-effects.  This allows users to call 
 {{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe 
 that this almost always indicates an error and is not behavior that we should 
 support.  Since we don't allow {{start() stop() start()}} then I don't think 
 it makes sense to allow {{stop() start()}}.
 The current behavior can lead to resource leaks when StreamingContext 
 constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
 then I expect StreamingContext's underlying SparkContext to be stopped 
 irrespective of whether the StreamingContext has been started.  This is 
 useful when writing unit test fixtures.
 Prior discussions:
 - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
 - https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2725) Add instructions about how to build with Hive to building-with-maven.md

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2725.
--
Resolution: Fixed

The current {{building-spark.md}} has Hive-related build instructions. This was 
fixed along the way, it seems.

 Add instructions about how to build with Hive to building-with-maven.md
 ---

 Key: SPARK-2725
 URL: https://issues.apache.org/jira/browse/SPARK-2725
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2720) spark-examples should depend on HBase modules for HBase 0.96+

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2720.
--
Resolution: Duplicate

Basically subsumed by SPARK-1297, which was resolved.

 spark-examples should depend on HBase modules for HBase 0.96+
 -

 Key: SPARK-2720
 URL: https://issues.apache.org/jira/browse/SPARK-2720
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor

 With this change:
 {code}
 diff --git a/pom.xml b/pom.xml
 index 93ef3b9..092430a 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -122,7 +122,7 @@
  hadoop.version1.0.4/hadoop.version
  protobuf.version2.4.1/protobuf.version
  yarn.version${hadoop.version}/yarn.version
 -hbase.version0.94.6/hbase.version
 +hbase.version0.98.4/hbase.version
  zookeeper.version3.4.5/zookeeper.version
  hive.version0.12.0/hive.version
  parquet.version1.4.3/parquet.version
 {code}
 I got:
 {code}
 [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
 resolve dependencies for project 
 org.apache.spark:spark-examples_2.10:jar:1.1.0-SNAPSHOT: Could not find 
 artifact org.apache.hbase:hbase:jar:0.98.4 in maven-repo 
 (http://repo.maven.apache.org/maven2) - [Help 1]
 {code}
 To build against HBase 0.96+, spark-examples needs to specify HBase modules 
 (hbase-client, etc) in dependencies - possibly using a new profile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2732) Update build script to Tachyon 0.5.0

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2732.
--
Resolution: Duplicate

Dupe of SPARK-2702, and resolved anyway.

 Update build script to Tachyon 0.5.0
 

 Key: SPARK-2732
 URL: https://issues.apache.org/jira/browse/SPARK-2732
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Henry Saputra

 Update Maven pom.xml and sbt script to use Tachyon 0.5.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2733) Update make-distribution.sh to download Tachyon 0.5.0

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2733.
--
Resolution: Duplicate

Also roughly a duplicate of SPARK-2702, like its parent, and also resolved.

 Update make-distribution.sh to download Tachyon 0.5.0
 -

 Key: SPARK-2733
 URL: https://issues.apache.org/jira/browse/SPARK-2733
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Henry Saputra

 Need to update make-distribution.sh to download Tachyon 0.5.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2819) Difficult to turn on intercept with linear models

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206415#comment-14206415
 ] 

Sean Owen commented on SPARK-2819:
--

Is this still in play ? the convenience methods can't cover every possible 
combination of params or else they merely duplicate the constructors in a 
complicated way. 

 Difficult to turn on intercept with linear models
 -

 Key: SPARK-2819
 URL: https://issues.apache.org/jira/browse/SPARK-2819
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sandy Ryza

 If I want to train a logistic regression model with default parameters and 
 include an intercept, I can run:
 val alg = new LogisticRegressionWithSGD()
 alg.setIntercept(true)
 alg.run(data)
 but if I want to set a parameter like numIterations, I need to use
 LogisticRegressionWithSGD.train(data, 50)
 and have no opportunity to turn on the intercept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2982) Glitch of spark streaming

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2982.
--
  Resolution: Invalid
Target Version/s:   (was: 1.0.2)

 Glitch of spark streaming
 -

 Key: SPARK-2982
 URL: https://issues.apache.org/jira/browse/SPARK-2982
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: dai zhiyuan
 Attachments: cpu.png, io.png, network.png


 spark streaming task startup time is very focused,It creates a problem which 
 is glitch of (network and cpu) , and cpu and network  is in an idle state at 
 lot of time,which is  wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2121) Not fully cached when there is enough memory in ALS

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2121.
--
Resolution: Not a Problem

 Not fully cached when there is enough memory in ALS
 ---

 Key: SPARK-2121
 URL: https://issues.apache.org/jira/browse/SPARK-2121
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Shuo Xiang

 While factorizing a large matrix using the latest Alternating Least Squares 
 (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the 
 partitions of some RDD while memory is sufficient. Please find [this 
 post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html)
  for screenshots. This may cause subsequent job failures while executing 
 `userOut.Count()` or `productsOut.count`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3228) When DStream save RDD to hdfs , don't create directory and empty file if there are no data received from source in the batch duration .

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3228.
--
Resolution: Won't Fix

Given PR discussion, sounds like a WontFix

 When DStream save RDD to hdfs , don't create directory and empty file if 
 there are no data received from source in the batch duration .
 ---

 Key: SPARK-3228
 URL: https://issues.apache.org/jira/browse/SPARK-3228
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Leo

 When I use DStream to save files to hdfs, it will create a directory and a 
 empty file named _SUCCESS for each job which made in the batch duration.
 But if there are no data from source for a long time , and the duration is 
 very short(e.g. 10s), it will create so many directory and empty files in 
 hdfs.
 I don't think it is necessary. So I want to modify class DStream's method 
 saveAsObjectFiles and saveAsTextFiles , it creates directory and files just 
 when the RDD's partitions size  0 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3624) Failed to find Spark assembly in /usr/share/spark/lib for RELEASED debian packages

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3624.
--
Resolution: Won't Fix

PR discussion seems to have several votes for WontFix

 Failed to find Spark assembly in /usr/share/spark/lib for RELEASED debian 
 packages
 

 Key: SPARK-3624
 URL: https://issues.apache.org/jira/browse/SPARK-3624
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 1.1.0
Reporter: Christian Tzolov
Priority: Minor

 The compute-classpath.sh requires that for a 'RELASED' package the Spark 
 assembly jar is accessible from a spark home/lib folder.
 Currently the jdeb packaging (assembly module) bundles the assembly jar into 
 a folder called 'jars'. 
 The result is :
 /usr/share/spark/bin/spark-submit   --num-executors 10--master 
 yarn-cluster   --class org.apache.spark.examples.SparkPi   
 /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10
 ls: cannot access /usr/share/spark/lib: No such file or directory
 Failed to find Spark assembly in /usr/share/spark/lib
 You need to build Spark before running this program.
 Trivial solution is to rename the 'prefix${deb.install.path}/jars/prefix' 
 inside assembly/pom.xml to prefix${deb.install.path}/lib/prefix.
 Another less impactful (considering backward compatibility) solution is to 
 define a lib-jars symlink in the assembly/pom.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1317) sbt doesn't work for building Spark programs

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1317.
--
Resolution: Not a Problem

 sbt doesn't work for building Spark programs
 

 Key: SPARK-1317
 URL: https://issues.apache.org/jira/browse/SPARK-1317
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 0.9.0
Reporter: Diana Carroll

 I don't know if this is a doc bug or a product bug, because I don't know how 
 it is supposed to work.
 The Spark quick start guide page has a section that walks you through 
 creating a standalone Spark app in Scala.  I think the instructions worked 
 in 0.8.1 but I can't get them to work in 0.9.0.
 The instructions have you create a directory structure in the canonical sbt 
 format, but do not tell you where to locate this directory.  However, after 
 setting up the structure, the tutorial then instructs you to use the command 
 {code}sbt/sbt package{code}
 which implies that the working directory must be SPARK_HOME.
 I tried it both ways: creating a mysparkapp directory right in SPARK_HOME 
 and creating it in my home directory.  Neither worked, with different results:
 - if I create a mysparkapp directory as instructed in SPARK_HOME, cd to 
 SPARK_HOME and run the command sbt/sbt package as specified, it packages ALL 
 of Spark...but does not build my own app.
 - if I create a mysparkapp directory elsewhere, cd to that directory, and 
 run the command there, I get an error:
 {code}
 $SPARK_HOME/sbt/sbt package
 awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
 reading (No such file or directory)
 Attempting to fetch sbt
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
 install sbt manually from http://www.scala-sbt.org/
 {code}
 So, either:
 1: the Spark distribution of sbt can only be used to build Spark itself, not 
 you own code...in which case the quick start guide is wrong, and should 
 instead say that users should install sbt separately
 OR
 2: the Spark distribution of sbt CAN be used, with property configuration, in 
 which case that configuration should be documented (I wasn't able to figure 
 it out, but I didn't try that hard either)
 OR
 3: the Spark distribution of sbt is *supposed* to be able to build Spark 
 apps, but is configured incorrectly in the product, in which case there's a 
 product bug rather than a doc bug
 Although this is not a show-stopper, because the obvious workaround is to 
 simply install sbt separately, I think at least updating the docs is pretty 
 high priority, because most people learning Spark start with that Quick Start 
 page, which doesn't work.
 (If it's doc issue #1, let me know, and I'll fix the docs myself.  :-) )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1463) cleanup unnecessary dependency jars in the spark assembly jars

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1463.
--
   Resolution: Fixed
Fix Version/s: (was: 1.0.0)

Fixed at some point, it seems. No longer in the project.

 cleanup unnecessary dependency jars in the spark assembly jars
 --

 Key: SPARK-1463
 URL: https://issues.apache.org/jira/browse/SPARK-1463
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Jenny MA
Priority: Minor
  Labels: easyfix

 there are couple GPL/LGPL based dependencies which are included in the final 
 assembly jar, which are not used by spark runtime.  identified the following 
 libraries. we can provide a fix in assembly/pom.xml. 
 excludecom.google.code.findbugs:*/exclude
 excludeorg.acplt:oncrpc:*/exclude
 excludeglassfish:*/exclude
 excludecom.cenqua.clover:clover:*/exclude
 excludeorg.glassfish:*/exclude
 excludeorg.glassfish.grizzly:*/exclude
 excludeorg.glassfish.gmbal:*/exclude 
 excludeorg.glassfish.external:*/exclude
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4343) Mima considers protected API methods for exclusion from binary checks.

2014-11-11 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-4343:
--

 Summary: Mima considers protected API methods for exclusion from 
binary checks. 
 Key: SPARK-4343
 URL: https://issues.apache.org/jira/browse/SPARK-4343
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma


Related SPARK-4335



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4343) Mima considers protected API methods for exclusion from binary checks.

2014-11-11 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206472#comment-14206472
 ] 

Prashant Sharma commented on SPARK-4343:


I am not sure if its a desired behaviour. From my understanding this might be 
mostly okay. 

 Mima considers protected API methods for exclusion from binary checks. 
 ---

 Key: SPARK-4343
 URL: https://issues.apache.org/jira/browse/SPARK-4343
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma

 Related SPARK-4335



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4034) get java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder in idea

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206505#comment-14206505
 ] 

Sean Owen commented on SPARK-4034:
--

I use IDEA and I have never encountered this. From the PR discussion I'm not 
clear the proposed changed is not going to disrupt the rest of the build. I 
suggest closing?

 get  java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder  in idea
 

 Key: SPARK-4034
 URL: https://issues.apache.org/jira/browse/SPARK-4034
 Project: Spark
  Issue Type: Bug
Reporter: baishuo





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1216) Add a OneHotEncoder for handling categorical features

2014-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206526#comment-14206526
 ] 

Sean Owen commented on SPARK-1216:
--

[~sandyr] This is basically https://issues.apache.org/jira/browse/SPARK-4081 
and Joseph has a PR for it now?

 Add a OneHotEncoder for handling categorical features
 -

 Key: SPARK-1216
 URL: https://issues.apache.org/jira/browse/SPARK-1216
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza

 It would be nice to add something to MLLib to make it easy to do one-of-K 
 encoding of categorical features.
 Something like:
 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3624) Failed to find Spark assembly in /usr/share/spark/lib for RELEASED debian packages

2014-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-3624:
--

I jumped the gun; per Mark's comments this still deserves some resolution.

 Failed to find Spark assembly in /usr/share/spark/lib for RELEASED debian 
 packages
 

 Key: SPARK-3624
 URL: https://issues.apache.org/jira/browse/SPARK-3624
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 1.1.0
Reporter: Christian Tzolov
Priority: Minor

 The compute-classpath.sh requires that for a 'RELASED' package the Spark 
 assembly jar is accessible from a spark home/lib folder.
 Currently the jdeb packaging (assembly module) bundles the assembly jar into 
 a folder called 'jars'. 
 The result is :
 /usr/share/spark/bin/spark-submit   --num-executors 10--master 
 yarn-cluster   --class org.apache.spark.examples.SparkPi   
 /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10
 ls: cannot access /usr/share/spark/lib: No such file or directory
 Failed to find Spark assembly in /usr/share/spark/lib
 You need to build Spark before running this program.
 Trivial solution is to rename the 'prefix${deb.install.path}/jars/prefix' 
 inside assembly/pom.xml to prefix${deb.install.path}/lib/prefix.
 Another less impactful (considering backward compatibility) solution is to 
 define a lib-jars symlink in the assembly/pom.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1830) Deploy failover, Make Persistence engine and LeaderAgent Pluggable.

2014-11-11 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-1830.
---
   Resolution: Fixed
Fix Version/s: (was: 1.2.0)
   (was: 1.0.1)
   1.3.0
 Assignee: Prashant Sharma

 Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
 ---

 Key: SPARK-1830
 URL: https://issues.apache.org/jira/browse/SPARK-1830
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.3.0


 With current code base it is difficult to plugin an external user specified 
 Persistence Engine or Election Agent. It would be good to expose this as 
 a pluggable API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4344) spark.yarn.user.classpath.first is undocumented

2014-11-11 Thread Arun Ahuja (JIRA)
Arun Ahuja created SPARK-4344:
-

 Summary: spark.yarn.user.classpath.first is undocumented
 Key: SPARK-4344
 URL: https://issues.apache.org/jira/browse/SPARK-4344
 Project: Spark
  Issue Type: Documentation
Affects Versions: 1.1.0
Reporter: Arun Ahuja
Priority: Trivial


spark.yarn.user.classpath.first is not documented while 
spark.files.userClassPathFirst and does not point the corresponding yarn 
parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4344) spark.yarn.user.classpath.first is undocumented

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206722#comment-14206722
 ] 

Apache Spark commented on SPARK-4344:
-

User 'arahuja' has created a pull request for this issue:
https://github.com/apache/spark/pull/3209

 spark.yarn.user.classpath.first is undocumented
 ---

 Key: SPARK-4344
 URL: https://issues.apache.org/jira/browse/SPARK-4344
 Project: Spark
  Issue Type: Documentation
Affects Versions: 1.1.0
Reporter: Arun Ahuja
Priority: Trivial

 spark.yarn.user.classpath.first is not documented while 
 spark.files.userClassPathFirst and does not point the corresponding yarn 
 parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table

2014-11-11 Thread Alex Liu (JIRA)
Alex Liu created SPARK-4345:
---

 Summary: Spark SQL Hive throws exception when drop a none-exist 
table
 Key: SPARK-4345
 URL: https://issues.apache.org/jira/browse/SPARK-4345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu
Priority: Minor


When drop a none-exist hive table, an exception is thrown.
log
{code}
scala val t = hql(drop table if exists test_table);
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
t: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[13] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
DropTable test_table, true

scala val t = hql(drop table if exists test_table);
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
14/11/11 10:21:49 ERROR Hive: NoSuchObjectException(message:default.test_table 
table not found)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
at com.sun.proxy.$Proxy14.get_table(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy15.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892)
at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
at 
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65)
at 
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63)
at 
org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106)
at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110)
at 
$line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69)
at 
$line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74)
at 
$line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76)
at 
$line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:78)
at 
$line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:80)
  

[jira] [Created] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2014-11-11 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-4346:


 Summary: YarnClientSchedulerBack.asyncMonitorApplication should be 
common with Client.monitorApplication
 Key: SPARK-4346
 URL: https://issues.apache.org/jira/browse/SPARK-4346
 Project: Spark
  Issue Type: Improvement
Reporter: Thomas Graves


The YarnClientSchedulerBackend.asyncMonitorApplication routine should move into 
ClientBase and be made common with monitorApplication.  Make sure stop is 
handled properly.

See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4282) Stopping flag in YarnClientSchedulerBackend should be volatile

2014-11-11 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-4282.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
   1.2.0
 Assignee: Kousuke Saruta

 Stopping flag in YarnClientSchedulerBackend should be volatile
 --

 Key: SPARK-4282
 URL: https://issues.apache.org/jira/browse/SPARK-4282
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.2.0, 1.3.0


 In YarnClientSchedulerBackend, a variable stopping is used as a flag and 
 it's accessed by some threads so it should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4347) GradientBoostingSuite takes more than 1 minute to finish

2014-11-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4347:


 Summary: GradientBoostingSuite takes more than 1 minute to finish
 Key: SPARK-4347
 URL: https://issues.apache.org/jira/browse/SPARK-4347
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng


On a MacBook Pro:

{code}
[info] GradientBoostingSuite:
[info] - Regression with continuous features: SquaredError (22 seconds, 875 
milliseconds)
[info] - Regression with continuous features: Absolute Error (25 seconds, 652 
milliseconds)
[info] - Binary classification with continuous features: Log Loss (26 seconds, 
604 milliseconds)
{code}

Maybe we can reduce the size of test data and make it faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4169) [Core] Locale dependent code

2014-11-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4169:
-
Assignee: Niklas Wilcke

 [Core] Locale dependent code
 

 Key: SPARK-4169
 URL: https://issues.apache.org/jira/browse/SPARK-4169
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Debian, Locale: de_DE
Reporter: Niklas Wilcke
Assignee: Niklas Wilcke
  Labels: patch, test
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 With a non english locale the method isBindCollision in
 core/src/main/scala/org/apache/spark/util/Utils.scala
 doesn't work because it checks the exception message, which is locale 
 dependent.
 The test suite 
 core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
 also contains a locale dependent test string formatting of time durations 
 which uses a DecimalSeperator which is locale dependent.
 I created a pull request on github to solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3682) Add helpful warnings to the UI

2014-11-11 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206891#comment-14206891
 ] 

Kay Ousterhout commented on SPARK-3682:
---

Some of the metrics you mentioned fall under the additional metrics that are 
hidden by default; as part of this, it might be nice to automatically show a 
metric as part of warning a user that the value is problematic.

 Add helpful warnings to the UI
 --

 Key: SPARK-3682
 URL: https://issues.apache.org/jira/browse/SPARK-3682
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Sandy Ryza
 Attachments: SPARK-3682Design.pdf


 Spark has a zillion configuration options and a zillion different things that 
 can go wrong with a job.  Improvements like incremental and better metrics 
 and the proposed spark replay debugger provide more insight into what's going 
 on under the covers.  However, it's difficult for non-advanced users to 
 synthesize this information and understand where to direct their attention. 
 It would be helpful to have some sort of central location on the UI users 
 could go to that would provide indications about why an app/job is failing or 
 performing poorly.
 Some helpful messages that we could provide:
 * Warn that the tasks in a particular stage are spending a long time in GC.
 * Warn that spark.shuffle.memoryFraction does not fit inside the young 
 generation.
 * Warn that tasks in a particular stage are very short, and that the number 
 of partitions should probably be decreased.
 * Warn that tasks in a particular stage are spilling a lot, and that the 
 number of partitions should probably be increased.
 * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
 lot of time is being spent recomputing it.
 To start, probably two kinds of warnings would be most helpful.
 * Warnings at the app level that report on misconfigurations, issues with the 
 general health of executors.
 * Warnings at the job level that indicate why a job might be performing 
 slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4347) GradientBoostingSuite takes more than 1 minute to finish

2014-11-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4347:
-
Assignee: Manish Amde

 GradientBoostingSuite takes more than 1 minute to finish
 

 Key: SPARK-4347
 URL: https://issues.apache.org/jira/browse/SPARK-4347
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Manish Amde

 On a MacBook Pro:
 {code}
 [info] GradientBoostingSuite:
 [info] - Regression with continuous features: SquaredError (22 seconds, 875 
 milliseconds)
 [info] - Regression with continuous features: Absolute Error (25 seconds, 652 
 milliseconds)
 [info] - Binary classification with continuous features: Log Loss (26 
 seconds, 604 milliseconds)
 {code}
 Maybe we can reduce the size of test data and make it faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-11-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207031#comment-14207031
 ] 

Yin Huai commented on SPARK-2205:
-

Just a note to myself. It will be good to also look at if outputPartitioning in 
other physical operators are properly set. For example, the outputPartitioning 
in LeftSemiJoinHash is using the default UnknownPartitioning.

 Unnecessary exchange operators in a join on multiple tables with the same 
 join key.
 ---

 Key: SPARK-2205
 URL: https://issues.apache.org/jira/browse/SPARK-2205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor

 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (y.key=z.key))
 SchemaRDD[1] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
  HashJoin [key#6], [key#8], BuildRight
   Exchange (HashPartitioning [key#6], 200)
HashJoin [key#4], [key#6], BuildRight
 Exchange (HashPartitioning [key#4], 200)
  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
 Some(x)), None
 Exchange (HashPartitioning [key#6], 200)
  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#8], 200)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
 None
 {code}
 However, this is fine...
 {code}
 hql(select * from src x join src y on (x.key=y.key) join src z on 
 (x.key=z.key))
 res5: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[5] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
  HashJoin [key#26], [key#30], BuildRight
   HashJoin [key#26], [key#28], BuildRight
Exchange (HashPartitioning [key#26], 200)
 HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
 Some(x)), None
Exchange (HashPartitioning [key#28], 200)
 HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
 Some(y)), None
   Exchange (HashPartitioning [key#30], 200)
HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
 Some(z)), None
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4205) Timestamp and Date objects with comparison operators

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207037#comment-14207037
 ] 

Apache Spark commented on SPARK-4205:
-

User 'culler' has created a pull request for this issue:
https://github.com/apache/spark/pull/3210

 Timestamp and Date objects with comparison operators
 

 Key: SPARK-4205
 URL: https://issues.apache.org/jira/browse/SPARK-4205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Marc Culler
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1739) Close PR's after period of inactivity

2014-11-11 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207112#comment-14207112
 ] 

Josh Rosen commented on SPARK-1739:
---

My proposal would be to have SparkQA post a comment in the PR that mentions the 
component maintainers.  This could happen once a PR sits inactive or unreviewed 
for more than X days.  I can do this myself, but I'm kind of overloaded with 
other work so this is going to be a low priority.  I'd welcome pull requests 
for this, though: https://github.com/databricks/spark-pr-dashboard

 Close PR's after period of inactivity
 -

 Key: SPARK-1739
 URL: https://issues.apache.org/jira/browse/SPARK-1739
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Josh Rosen

 Sometimes PR's get abandoned if people aren't responsive to feedback or it 
 just falls to a lower priority. We should automatically close stale PR's in 
 order to keep the queue from growing infinitely.
 I think we just want to do this with a friendly message that says This seems 
 inactive, please re-open this if you are interested in contributing the 
 patch.. We should also explicitly ping any reviewers (via @mentioning) them 
 and ask them to provide feedback one way or the other, for instance, if the 
 feature is being rejected.
 This will help us avoid letting features slip through the cracks by forcing 
 some action when there is no activity after 30 days. Also, it's ASF policy 
 that we should really be tracking our feature backlog and prioritization in 
 JIRA and only be using Github for active reviews.
 I don't think we should close it if there was _no_ feedback from any reviewer 
 - in that case we should leave it open (we should be providing at least some 
 feedback on all incoming patches).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4326) unidoc is broken on master

2014-11-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207156#comment-14207156
 ] 

Marcelo Vanzin commented on SPARK-4326:
---

Funny that it doesn't happen on 1.2 since the dependency mess should be the 
same in both. I'll try this out when I'm done with some other tests, to see if 
I can figure it out.

 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^
 [info] No documentation generated with unsucessful compiler run
 [warn] two warnings found
 [error] 6 errors found
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM
 {code}
 It doesn't happen on branch-1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4348) pyspark.mllib.random conflicts with random module

2014-11-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-4348:
-

 Summary: pyspark.mllib.random conflicts with random module
 Key: SPARK-4348
 URL: https://issues.apache.org/jira/browse/SPARK-4348
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker


There are conflict in two cases:

1. random module is used by pyspark.mllib.feature, if the first part of 
sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
conflict.

2. Run tests in mllib/xxx.py, the '' should be popped out before import 
anything, or it will fail.

The first one is not fully fixed for user, it will introduce problems in some 
cases, such as:

{code}
 import sys
 import sys.insert(0, PATH_OF_MODULE)
 import pyspark
 # use Word2Vec will fail
{code}

I'd like to rename mllib/random.py as random/_random.py, then in mllib/__init.py

{code}
import pyspark.mllib._random as random
{code}


cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4345) Spark SQL Hive throws exception when drop a none-exist table

2014-11-11 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207171#comment-14207171
 ] 

Alex Liu commented on SPARK-4345:
-

Swallow NoSuchObjectException exception when drop a none-exist hive table. pull 
@https://github.com/apache/spark/pull/3211

 Spark SQL Hive throws exception when drop a none-exist table
 

 Key: SPARK-4345
 URL: https://issues.apache.org/jira/browse/SPARK-4345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu
Priority: Minor

 When drop a none-exist hive table, an exception is thrown.
 log
 {code}
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 t: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[13] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 DropTable test_table, true
 scala val t = hql(drop table if exists test_table);
 warning: there were 1 deprecation warning(s); re-run with -deprecation for 
 details
 14/11/11 10:21:49 ERROR Hive: 
 NoSuchObjectException(message:default.test_table table not found)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
   at com.sun.proxy.$Proxy14.get_table(Unknown Source)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
   at com.sun.proxy.$Proxy15.getTable(Unknown Source)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:892)
   at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3276)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:65)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:63)
   at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:71)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:106)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:110)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:69)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:74)
   at 
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:76)
   at 
 

[jira] [Commented] (SPARK-4326) unidoc is broken on master

2014-11-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207177#comment-14207177
 ] 

Nicholas Chammas commented on SPARK-4326:
-

Side question: Should we be (or are we already) regularly building the docs to 
catch these problems at PR/review time?

 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^
 [info] No documentation generated with unsucessful compiler run
 [warn] two warnings found
 [error] 6 errors found
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM
 {code}
 It doesn't happen on branch-1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2014-11-11 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-4349:
-

 Summary: Spark driver hangs on sc.parallelize() if exception is 
thrown during serialization
 Key: SPARK-4349
 URL: https://issues.apache.org/jira/browse/SPARK-4349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Matt Cheah
 Fix For: 1.3.0


Executing the following in the Spark Shell will lead to the Spark Shell hanging 
after a stack trace is printed. The serializer is set to the Kryo serializer.
{code}
scala import com.esotericsoftware.kryo.io.Input
import com.esotericsoftware.kryo.io.Input

scala import com.esotericsoftware.kryo.io.Output
import com.esotericsoftware.kryo.io.Output

scala class MyKryoSerializable extends 
com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
com.esotericsoftware.kryo.KryoException; } }
defined class MyKryoSerializable

scala sc.parallelize(Seq(new MyKryoSerializable, new 
MyKryoSerializable)).collect
{code}

A stack trace is printed during serialization as expected, but another stack 
trace is printed afterwards, indicating that the driver can't recover:

{code}
14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
unique!
akka.actor.PostRestartException: exception post restart (class 
java.io.IOException)
at 
akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
at 
akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
at 
akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
at 
akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
at 
akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] is 
not unique!
at 
akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
at org.apache.spark.executor.Executor.init(Executor.scala:97)
at 
org.apache.spark.scheduler.local.LocalActor.init(LocalBackend.scala:53)
at 
org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
at 
org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
at akka.actor.Props.newActor(Props.scala:252)
at akka.actor.ActorCell.newActor(ActorCell.scala:552)
at 
akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234)
... 11 more
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2014-11-11 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207185#comment-14207185
 ] 

Matt Cheah commented on SPARK-4349:
---

I'm investigating this now. Someone can assign to me.

 Spark driver hangs on sc.parallelize() if exception is thrown during 
 serialization
 --

 Key: SPARK-4349
 URL: https://issues.apache.org/jira/browse/SPARK-4349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Matt Cheah
 Fix For: 1.3.0


 Executing the following in the Spark Shell will lead to the Spark Shell 
 hanging after a stack trace is printed. The serializer is set to the Kryo 
 serializer.
 {code}
 scala import com.esotericsoftware.kryo.io.Input
 import com.esotericsoftware.kryo.io.Input
 scala import com.esotericsoftware.kryo.io.Output
 import com.esotericsoftware.kryo.io.Output
 scala class MyKryoSerializable extends 
 com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
 com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
 com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
 com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
 com.esotericsoftware.kryo.KryoException; } }
 defined class MyKryoSerializable
 scala sc.parallelize(Seq(new MyKryoSerializable, new 
 MyKryoSerializable)).collect
 {code}
 A stack trace is printed during serialization as expected, but another stack 
 trace is printed afterwards, indicating that the driver can't recover:
 {code}
 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
 unique!
 akka.actor.PostRestartException: exception post restart (class 
 java.io.IOException)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at 
 akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
 is not unique!
   at 
 akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
   at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
   at org.apache.spark.executor.Executor.init(Executor.scala:97)
   at 
 org.apache.spark.scheduler.local.LocalActor.init(LocalBackend.scala:53)
   at 
 org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
   at 
 org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
   at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
   at akka.actor.Props.newActor(Props.scala:252)
   at akka.actor.ActorCell.newActor(ActorCell.scala:552)
   at 
 akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234)
   ... 11 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module

2014-11-11 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207199#comment-14207199
 ] 

Doris Xin commented on SPARK-4348:
--

I fully support this. It took a lot of hacking just to override the default 
random module in Python, and it wasn't clear if the override was the ideal 
solution.

 pyspark.mllib.random conflicts with random module
 -

 Key: SPARK-4348
 URL: https://issues.apache.org/jira/browse/SPARK-4348
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker

 There are conflict in two cases:
 1. random module is used by pyspark.mllib.feature, if the first part of 
 sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
 conflict.
 2. Run tests in mllib/xxx.py, the '' should be popped out before import 
 anything, or it will fail.
 The first one is not fully fixed for user, it will introduce problems in some 
 cases, such as:
 {code}
  import sys
  import sys.insert(0, PATH_OF_MODULE)
  import pyspark
  # use Word2Vec will fail
 {code}
 I'd like to rename mllib/random.py as random/_random.py, then in 
 mllib/__init.py
 {code}
 import pyspark.mllib._random as random
 {code}
 cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4092) Input metrics don't work for coalesce()'d RDD's

2014-11-11 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207190#comment-14207190
 ] 

Kostas Sakellis commented on SPARK-4092:


[~aash], yes this should solve a superset of the same problems that SPARK-2630 
aims to fix. I say superset because https://github.com/apache/spark/pull/3120 
also includes a similar fix when hadoop 2.5 is used with the bytesReadCallback. 


 Input metrics don't work for coalesce()'d RDD's
 ---

 Key: SPARK-4092
 URL: https://issues.apache.org/jira/browse/SPARK-4092
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kostas Sakellis
Priority: Critical

 In every case where we set input metrics (from both Hadoop and block storage) 
 we currently assume that exactly one input partition is computed within the 
 task. This is not a correct assumption in the general case. The main example 
 in the current API is coalesce(), but user-defined RDD's could also be 
 affected.
 To deal with the most general case, we would need to support the notion of a 
 single task having multiple input sources. A more surgical and less general 
 fix is to simply go to HadoopRDD and check if there are already inputMetrics 
 defined for the task with the same type. If there are, then merge in the 
 new data rather than blowing away the old one.
 This wouldn't cover case where, e.g. a single task has input from both 
 on-disk and in-memory blocks. It _would_ cover the case where someone calls 
 coalesce on a HadoopRDD... which is more common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4350) aggregate doesn't make copies of zeroValue in local mode

2014-11-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4350:


 Summary: aggregate doesn't make copies of zeroValue in local mode
 Key: SPARK-4350
 URL: https://issues.apache.org/jira/browse/SPARK-4350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.0.2, 1.2.0
Reporter: Xiangrui Meng
Priority: Critical


RDD.aggregate makes a copy of zeroValue to collect the task result. However, it 
doesn't make copies of zeroValue for each partition. In local mode, this causes 
race conditions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend

2014-11-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2269.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0

 Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
 ---

 Key: SPARK-2269
 URL: https://issues.apache.org/jira/browse/SPARK-2269
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Tim Chen
 Fix For: 1.2.0


 This function could be simplified a bit. We could re-write it without 
 offerableIndices or creating the mesosTasks array as large as the offer list. 
 There is a lot of logic around making sure you get the correct index into 
 mesosTasks and offers, really we should just build mesosTasks directly from 
 the offers we get back. To associate the tasks we are launching with the 
 offers we can just create a hashMap from the slaveId to the original offer.
 The basic logic of the function is that you take the mesos offers, convert 
 them to spark offers, then convert the results back.
 One reason I think it might be designed as it is now is to deal with the case 
 where Mesos gives multiple offers for a single slave. I checked directly with 
 the Mesos team and they said this won't ever happen, you'll get at most one 
 offer per mesos slave within a set of offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4228) Save a ScheamRDD in JSON format

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207262#comment-14207262
 ] 

Apache Spark commented on SPARK-4228:
-

User 'dwmclary' has created a pull request for this issue:
https://github.com/apache/spark/pull/3213

 Save a ScheamRDD in JSON format
 ---

 Key: SPARK-4228
 URL: https://issues.apache.org/jira/browse/SPARK-4228
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4228) Save a ScheamRDD in JSON format

2014-11-11 Thread Dan McClary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207261#comment-14207261
 ] 

Dan McClary commented on SPARK-4228:


Pull request here: https://github.com/apache/spark/pull/3213

 Save a ScheamRDD in JSON format
 ---

 Key: SPARK-4228
 URL: https://issues.apache.org/jira/browse/SPARK-4228
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207298#comment-14207298
 ] 

Debasish Das commented on SPARK-3066:
-

[~mengxr] I am testing recommendAllUsers and recommendAllProducts API and I 
will add the code to RankingMetrics PR:
https://github.com/apache/spark/pull/3098

I have not used level-3 BLAS yet since we should be able to re-use 
DistributedMatrix API that's coming online (here all the matrices are 
dense)...I used ideas 1 and 2 and I also add a skipRatings in the API (using 
that you can skip the ratings that each user has already provided...for the 
validation I skip the train set basically)

Example API:

def recommendAllUsers(num: Int, skipUserRatings: RDD[Rating]) = {
val skipUsers = skipUserRatings.map { x = ((x.user, x.product), x.rating) }
val productVectors = productFeatures.collect
recommend(productVectors, userFeatures, num, skipUsers)
  }

  def recommendAllProducts(num: Int, skipProductRatings: RDD[Rating]) = {
val skipProducts = skipProductRatings.map { x = ((x.product, x.user), 
x.rating) }
val userVectors = userFeatures.collect
recommend(userVectors, productFeatures, num, skipProducts)
  }

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4157) Task input statistics incomplete when a task reads from multiple locations

2014-11-11 Thread Charles Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Reiss resolved SPARK-4157.
--
Resolution: Duplicate

 Task input statistics incomplete when a task reads from multiple locations
 --

 Key: SPARK-4157
 URL: https://issues.apache.org/jira/browse/SPARK-4157
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Charles Reiss
Priority: Minor

 SPARK-1683 introduced tracking of filesystem reads for tasks, but the 
 tracking code assumes that each task reads from exactly one file/cache block, 
 and replaces any prior InputMetrics object for a task after each read.
 But, for example, a task computing a shuffle-less join (input RDDs are 
 prepartitioned by key) may read two or more cached dependency RDD blocks from 
 cache. In this case, the displayed input size will be for whichever 
 dependency was requested last.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2014-11-11 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207392#comment-14207392
 ] 

Matt Cheah commented on SPARK-4349:
---

Investigation showed that the DAGScheduler may not catch un-serializable tasks, 
and the Task set manager assumes that serialization exceptions are caught in 
the DAGScheduler.

What happens is in DAGScheduler.submitMissingTasks, a Seq of tasks is created 
and the first task in the set is proactively serialized to check for 
exceptions. However, in the case of parallel collection partitions and the code 
I provided above, the first task can be serialized since the first task's 
partition has an empty array for values, while other tasks in the array may 
have the actual data that cannot be serialized.

I'm not sure what the best way to go forward is. Proactively serializing all of 
the tasks is too expensive.

 Spark driver hangs on sc.parallelize() if exception is thrown during 
 serialization
 --

 Key: SPARK-4349
 URL: https://issues.apache.org/jira/browse/SPARK-4349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Matt Cheah
 Fix For: 1.3.0


 Executing the following in the Spark Shell will lead to the Spark Shell 
 hanging after a stack trace is printed. The serializer is set to the Kryo 
 serializer.
 {code}
 scala import com.esotericsoftware.kryo.io.Input
 import com.esotericsoftware.kryo.io.Input
 scala import com.esotericsoftware.kryo.io.Output
 import com.esotericsoftware.kryo.io.Output
 scala class MyKryoSerializable extends 
 com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
 com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
 com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
 com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
 com.esotericsoftware.kryo.KryoException; } }
 defined class MyKryoSerializable
 scala sc.parallelize(Seq(new MyKryoSerializable, new 
 MyKryoSerializable)).collect
 {code}
 A stack trace is printed during serialization as expected, but another stack 
 trace is printed afterwards, indicating that the driver can't recover:
 {code}
 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
 unique!
 akka.actor.PostRestartException: exception post restart (class 
 java.io.IOException)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
   at 
 akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at 
 akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
   at 
 akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
 is not unique!
   at 
 akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
   at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
   at org.apache.spark.executor.Executor.init(Executor.scala:97)
   at 
 

[jira] [Updated] (SPARK-4322) Analysis incorrectly rejects accessing grouping fields

2014-11-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4322:

Assignee: Cheng Lian

 Analysis incorrectly rejects accessing grouping fields
 --

 Key: SPARK-4322
 URL: https://issues.apache.org/jira/browse/SPARK-4322
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 {code}
 sqlContext.jsonRDD(sc.parallelize({a: {b: [{c: 1}]}} :: 
 Nil)).registerTempTable(data)
 sqlContext.sql(SELECT a.b[0].c FROM data GROUP BY a.b[0].c).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4258) NPE with new Parquet Filters

2014-11-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4258:

Assignee: Cheng Lian

 NPE with new Parquet Filters
 

 Key: SPARK-4258
 URL: https://issues.apache.org/jira/browse/SPARK-4258
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
 java.lang.NullPointerException: 
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
 
 parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 {code}
 This occurs when reading parquet data encoded with the older version of the 
 library for TPC-DS query 34.  Will work on coming up with a smaller 
 reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4347) GradientBoostingSuite takes more than 1 minute to finish

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207435#comment-14207435
 ] 

Apache Spark commented on SPARK-4347:
-

User 'manishamde' has created a pull request for this issue:
https://github.com/apache/spark/pull/3214

 GradientBoostingSuite takes more than 1 minute to finish
 

 Key: SPARK-4347
 URL: https://issues.apache.org/jira/browse/SPARK-4347
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Manish Amde

 On a MacBook Pro:
 {code}
 [info] GradientBoostingSuite:
 [info] - Regression with continuous features: SquaredError (22 seconds, 875 
 milliseconds)
 [info] - Regression with continuous features: Absolute Error (25 seconds, 652 
 milliseconds)
 [info] - Binary classification with continuous features: Log Loss (26 
 seconds, 604 milliseconds)
 {code}
 Maybe we can reduce the size of test data and make it faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4351) Record cacheable RDD reads and display RDD miss rates

2014-11-11 Thread Charles Reiss (JIRA)
Charles Reiss created SPARK-4351:


 Summary: Record cacheable RDD reads and display RDD miss rates
 Key: SPARK-4351
 URL: https://issues.apache.org/jira/browse/SPARK-4351
 Project: Spark
  Issue Type: Improvement
Reporter: Charles Reiss
Priority: Minor


Currently, when Spark fails to keep an RDD cached, there is little visibility 
to the user (beyond performance effects), especially if the user is not reading 
executor logs. We could expose this information to the Web UI and the event log 
like we do for RDD storage information by reporting RDD reads and their results 
with task metrics.

From this, live computation of RDD miss rates is straightforward, and 
information in the event log would enable more complicated post-hoc analyses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4338) Remove yarn-alpha support

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207450#comment-14207450
 ] 

Apache Spark commented on SPARK-4338:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3215

 Remove yarn-alpha support
 -

 Key: SPARK-4338
 URL: https://issues.apache.org/jira/browse/SPARK-4338
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2014-11-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-4352:
-

 Summary: Incorporate locality preferences in dynamic allocation 
requests
 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207463#comment-14207463
 ] 

Yu Ishikawa commented on SPARK-2429:


[~rnowling] Sorry for commenting again. Could you tell me what you think about 
the new function to cut a dendrogram?
(For example. you think we don't need the new function. Or we should make an 
advantage against KMeans from another point of view. )

1. This algorithm doesn't have an advantage about elapsed time of assignment 
against KMeans.
2. The new function generate another model to cut a dendrogram by height 
without re-training by another parameters.

Thanks,

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207511#comment-14207511
 ] 

Apache Spark commented on SPARK-4348:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3216

 pyspark.mllib.random conflicts with random module
 -

 Key: SPARK-4348
 URL: https://issues.apache.org/jira/browse/SPARK-4348
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker

 There are conflict in two cases:
 1. random module is used by pyspark.mllib.feature, if the first part of 
 sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
 conflict.
 2. Run tests in mllib/xxx.py, the '' should be popped out before import 
 anything, or it will fail.
 The first one is not fully fixed for user, it will introduce problems in some 
 cases, such as:
 {code}
  import sys
  import sys.insert(0, PATH_OF_MODULE)
  import pyspark
  # use Word2Vec will fail
 {code}
 I'd like to rename mllib/random.py as random/_random.py, then in 
 mllib/__init.py
 {code}
 import pyspark.mllib._random as random
 {code}
 cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module

2014-11-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207529#comment-14207529
 ] 

Davies Liu commented on SPARK-4348:
---

After some experiments, I found it's more harder than expected, it still need 
some hack to make it work (see the PR), but I think this hack is safer than 
before:

1. the rand.py module will not overwrite default random module, so it's safe to 
run the mllib/xxx.py without hacking, also we do not need hack to use random in 
mllib package.

2. the RandomModuleHook only installed when user try to import 'pyspark.mllib', 
it also only works for 'pyspark.mllib.random'.

Note: In order to use default random module, we need 'from __future__ import 
absolute_import' in the caller module, this also need as more. Without this, 
'import random' can be translated as 'from pyspark.mllib import random'.  So, 
there is a bug in master (Word2Vec)

 pyspark.mllib.random conflicts with random module
 -

 Key: SPARK-4348
 URL: https://issues.apache.org/jira/browse/SPARK-4348
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker

 There are conflict in two cases:
 1. random module is used by pyspark.mllib.feature, if the first part of 
 sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
 conflict.
 2. Run tests in mllib/xxx.py, the '' should be popped out before import 
 anything, or it will fail.
 The first one is not fully fixed for user, it will introduce problems in some 
 cases, such as:
 {code}
  import sys
  import sys.insert(0, PATH_OF_MODULE)
  import pyspark
  # use Word2Vec will fail
 {code}
 I'd like to rename mllib/random.py as random/_random.py, then in 
 mllib/__init.py
 {code}
 import pyspark.mllib._random as random
 {code}
 cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib

2014-11-11 Thread Ashutosh Trivedi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207550#comment-14207550
 ] 

Ashutosh Trivedi commented on SPARK-4038:
-

similar issue was opened at Mahout 
https://issues.apache.org/jira/browse/MAHOUT-384

[~sowen] What are your thoughts? I saw you were helping with the patch there. 
Please see the following for discussion on it

http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.htm


 Outlier Detection Algorithm for MLlib
 -

 Key: SPARK-4038
 URL: https://issues.apache.org/jira/browse/SPARK-4038
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ashutosh Trivedi
Priority: Minor

 The aim of this JIRA is to discuss about which parallel outlier detection 
 algorithms can be included in MLlib. 
 The one which I am familiar with is Attribute Value Frequency (AVF). It 
 scales linearly with the number of data points and attributes, and relies on 
 a single data scan. It is not distance based and well suited for categorical 
 data. In original paper  a parallel version is also given, which is not 
 complected to implement.  I am working on the implementation and soon submit 
 the initial code for review.
 Here is the Link for the paper
 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
 As pointed out by Xiangrui in discussion 
 http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
 There are other algorithms also. Lets discuss about which will be more 
 general and easily paralleled.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4353) Delete the val that never used

2014-11-11 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-4353:
-

 Summary: Delete the val that never used
 Key: SPARK-4353
 URL: https://issues.apache.org/jira/browse/SPARK-4353
 Project: Spark
  Issue Type: Wish
Reporter: DoingDone9
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4353) Delete the val that never used

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4353:
--
Issue Type: Improvement  (was: Wish)

 Delete the val that never used
 --

 Key: SPARK-4353
 URL: https://issues.apache.org/jira/browse/SPARK-4353
 Project: Spark
  Issue Type: Improvement
Reporter: DoingDone9
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4353) Delete the val that never used

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4353:
--
Component/s: SQL

 Delete the val that never used
 --

 Key: SPARK-4353
 URL: https://issues.apache.org/jira/browse/SPARK-4353
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 dbName in Catalog never used, like that val (dbName, tblName) = 
 processDatabaseAndTableName(databaseName, tableName); tables -= tblName. i 
 think it should be deleted,it should be val tblName = 
 processDatabaseAndTableName(databaseName, tableName)._2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4353) Delete the val that never used

2014-11-11 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4353:
--
Description: dbName in Catalog never used, like that val (dbName, 
tblName) = processDatabaseAndTableName(databaseName, tableName); tables -= 
tblName. i think it should be deleted,it should be val tblName = 
processDatabaseAndTableName(databaseName, tableName)._2

 Delete the val that never used
 --

 Key: SPARK-4353
 URL: https://issues.apache.org/jira/browse/SPARK-4353
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 dbName in Catalog never used, like that val (dbName, tblName) = 
 processDatabaseAndTableName(databaseName, tableName); tables -= tblName. i 
 think it should be deleted,it should be val tblName = 
 processDatabaseAndTableName(databaseName, tableName)._2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4351) Record cacheable RDD reads and display RDD miss rates

2014-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207566#comment-14207566
 ] 

Apache Spark commented on SPARK-4351:
-

User 'woggle' has created a pull request for this issue:
https://github.com/apache/spark/pull/3218

 Record cacheable RDD reads and display RDD miss rates
 -

 Key: SPARK-4351
 URL: https://issues.apache.org/jira/browse/SPARK-4351
 Project: Spark
  Issue Type: Improvement
Reporter: Charles Reiss
Priority: Minor

 Currently, when Spark fails to keep an RDD cached, there is little visibility 
 to the user (beyond performance effects), especially if the user is not 
 reading executor logs. We could expose this information to the Web UI and the 
 event log like we do for RDD storage information by reporting RDD reads and 
 their results with task metrics.
 From this, live computation of RDD miss rates is straightforward, and 
 information in the event log would enable more complicated post-hoc analyses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-11-11 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207568#comment-14207568
 ] 

Ryan Williams commented on SPARK-3630:
--

I'm seeing many Snappy {{FAILED_TO_UNCOMPRESS(5)}} and {{PARSING_ERROR(2)}} 
errors. I just built Spark yesterday off of 
[227488d|https://github.com/apache/spark/commit/227488d], so I expected that to 
have picked up some of the fixes detailed in this thread. I am running on a 
Yarn cluster whose 100 nodes have kernel 2.6.32 so in a few of these attempts I 
to used {{spark.file.transferTo=false}} and still saw these errors.

Here are some notes about some of my runs, along with the stdout I got:
* 1000 partitions, {{spark.file.transferTo=false}}: 
[stdout|https://www.dropbox.com/s/141keqpojucfbai/logs.1000?dl=0]. This was my 
latest run; it took a while to get to my reduceByKeyLocally stage, and 
immediately upon finishing the preceding stage it emitted ~190K 
{{FetchFailure}}s over ~200 attempts of the stage in about one minute, followed 
by some Snappy errors and the job shutting down.
* 2000 partitions, {{spark.file.transferTo=false}}: 
[stdout|https://www.dropbox.com/s/jr1dsldodq4rvbz/logs.2000?dl=0]. This one had 
~150 FetchFailures out of the gate, 
seemingly ran fine for ~8mins, then had a futures timeout, seemingly ran find 
for another ~17m, then got to my reduceByKeyLocally stage and died from Snappy 
errors.
* 2000 partitions, {{spark.file.transferTo=true}}: 
[stdout|https://www.dropbox.com/s/9n24ffcdq0j43ue/logs.2000.tt?dl=0]. Before 
running the above two, I was hoping that {{spark.file.transferTo=false}} was 
going to fix my problems, so I ran this to see whether 2000 partitions was the 
determining factor in the Snappy errors happening, as [~joshrosen] suggested in 
this thread. No such luck! ~15 FetchFailures right away, ran fine for 24mins, 
got to reduceByKeyLocally phase, Snappy-failed and died.
* these and other stdout logs can be found 
[here|https://www.dropbox.com/sh/pn0bik3tvy73wfi/AAByFlQVJ3QUOqiKYKXt31RGa?dl=0]

In all of these I was running on a dataset (~170GB) that should be easily 
handled by my cluster (5TB RAM total), and in fact I successfully ran this job 
against this dataset last night using a Spark 1.1 build. That job was dying of 
FetchFailures when I tried to run against a larger dataset (~300GB), and I 
thought maybe I needed shuffle sorting or external shuffle service, or other 
1.2.0 goodies, so I've been trying to run with 1.2.0 but can't get anything to 
finish.

This job reads a file in from hadoop, coalesces to the number of partitions 
I've asked for, and does a {{flatMap}}, a {reduceByKey}}, a map, and a 
{{reduceByKeyLocally}}. I am pretty confident that the {{Map}} I'm 
materializing onto the driver in the {{reduceByKeyLocally}} is a reasonable 
size; it's a {{Map[Long, Long]}} with about 40K entries, and I've actually 
successfully run this job on this data to materialize that exact map at 
different points this week, as I mentioned before. Something causes this job to 
die almost immediately upon starting the {{reduceByKeyLocally}} phase, however, 
usually just with Snappy errors, but with a preponderance of FetchFailures 
preceding them in my last attempt.

Let me know what other information I can provide that might be useful. Thanks!

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception 

  1   2   >