[jira] [Commented] (SPARK-5649) Throw exception when can not apply datatype cast

2015-02-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319724#comment-14319724
 ] 

Michael Armbrust commented on SPARK-5649:
-

https://github.com/apache/spark/pull/4558

 Throw exception when can not apply datatype cast
 

 Key: SPARK-5649
 URL: https://issues.apache.org/jira/browse/SPARK-5649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei
 Fix For: 1.3.0


 Throw exception when can not apply datatypes cast to info user the cast issue 
 in the sqls. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5649) Throw exception when can not apply datatype cast

2015-02-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5649.
-
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: wangfei

 Throw exception when can not apply datatype cast
 

 Key: SPARK-5649
 URL: https://issues.apache.org/jira/browse/SPARK-5649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.3.0


 Throw exception when can not apply datatypes cast to info user the cast issue 
 in the sqls. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java

2015-02-13 Thread Littlestar (JIRA)
Littlestar created SPARK-5795:
-

 Summary: api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not 
friendly to java
 Key: SPARK-5795
 URL: https://issues.apache.org/jira/browse/SPARK-5795
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Littlestar


import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

the following code can't compile on java.
JavaPairDStreamInteger, Integer rs =
rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, 
TextOutputFormat.class, jobConf);

but similar code in JavaPairRDD works ok.
JavaPairRDDString, String counts =...
counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, 
TextOutputFormat.class, jobConf);

mybe the 
  def saveAsNewAPIHadoopFiles(
  prefix: String,
  suffix: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[_ : NewOutputFormat[_, _]],
  conf: Configuration = new Configuration) {
dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
outputFormatClass, conf)
  }
=
def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]](
  prefix: String,
  suffix: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[F],
  conf: Configuration = new Configuration) {
dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
outputFormatClass, conf)
  }








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java

2015-02-13 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319769#comment-14319769
 ] 

Littlestar commented on SPARK-5795:
---

org.apache.spark.api.java.JavaPairRDDK, V
{noformat}
/** Output the RDD to any Hadoop-supported file system. */
  def saveAsHadoopFile[F : OutputFormat[_, _]](
  path: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[F],
  conf: JobConf) {
rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, conf)
  }

  /** Output the RDD to any Hadoop-supported file system. */
  def saveAsHadoopFile[F : OutputFormat[_, _]](
  path: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[F]) {
rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass)
  }

  /** Output the RDD to any Hadoop-supported file system, compressing with the 
supplied codec. */
  def saveAsHadoopFile[F : OutputFormat[_, _]](
  path: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[F],
  codec: Class[_ : CompressionCodec]) {
rdd.saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, codec)
  }

  /** Output the RDD to any Hadoop-supported file system. */
  def saveAsNewAPIHadoopFile[F : NewOutputFormat[_, _]](
  path: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[F],
  conf: Configuration) {
rdd.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass, 
conf)
  }

  /**
   * Output the RDD to any Hadoop-supported storage system, using
   * a Configuration object for that storage system.
   */
  def saveAsNewAPIHadoopDataset(conf: Configuration) {
rdd.saveAsNewAPIHadoopDataset(conf)
  }

  /** Output the RDD to any Hadoop-supported file system. */
  def saveAsNewAPIHadoopFile[F : NewOutputFormat[_, _]](
  path: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[F]) {
rdd.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass)
  }
{noformat}

org.apache.spark.streaming.api.java.JavaPairDStreamK, V

{noformat}
/**
   * Save each RDD in `this` DStream as a Hadoop file. The file name at each 
batch interval is
   * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix.
   */
  def saveAsHadoopFiles[F : OutputFormat[K, V]](prefix: String, suffix: 
String) {
dstream.saveAsHadoopFiles(prefix, suffix)
  }

  /**
   * Save each RDD in `this` DStream as a Hadoop file. The file name at each 
batch interval is
   * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix.
   */
  def saveAsHadoopFiles(
  prefix: String,
  suffix: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[_ : OutputFormat[_, _]]) {
dstream.saveAsHadoopFiles(prefix, suffix, keyClass, valueClass, 
outputFormatClass)
  }

  /**
   * Save each RDD in `this` DStream as a Hadoop file. The file name at each 
batch interval is
   * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix.
   */
  def saveAsHadoopFiles(
  prefix: String,
  suffix: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[_ : OutputFormat[_, _]],
  conf: JobConf) {
dstream.saveAsHadoopFiles(prefix, suffix, keyClass, valueClass, 
outputFormatClass, conf)
  }

  /**
   * Save each RDD in `this` DStream as a Hadoop file. The file name at each 
batch interval is
   * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix.
   */
  def saveAsNewAPIHadoopFiles[F : NewOutputFormat[K, V]](prefix: String, 
suffix: String) {
dstream.saveAsNewAPIHadoopFiles(prefix, suffix)
  }

  /**
   * Save each RDD in `this` DStream as a Hadoop file. The file name at each 
batch interval is
   * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix.
   */
  def saveAsNewAPIHadoopFiles(
  prefix: String,
  suffix: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[_ : NewOutputFormat[_, _]]) {
dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
outputFormatClass)
  }

  /**
   * Save each RDD in `this` DStream as a Hadoop file. The file name at each 
batch interval is
   * generated based on `prefix` and `suffix`: prefix-TIME_IN_MS.suffix.
   */
  def saveAsNewAPIHadoopFiles(
  prefix: String,
  suffix: String,
  keyClass: Class[_],
  valueClass: Class[_],
  outputFormatClass: Class[_ : NewOutputFormat[_, _]],
  conf: Configuration = new Configuration) {
dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
outputFormatClass, conf)
  }
{noformat}

 api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
 

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2015-02-13 Thread Sam Halliday (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319777#comment-14319777
 ] 

Sam Halliday commented on SPARK-3785:
-

Hi all, just joining the thread :-)

I'm the author of netlib-java. I recommend watching my ScalaX talk 
http://fommil.github.io/scalax14/#/ for anybody who hasn't seen it yet. I talk 
about beyond-CPU acceleration in the last few slides (just after the Breeze 
examples).

In my decade of industrial experience with these things, the GPU is a *lot* 
faster than the CPU for large matrix operations, but slower for smaller ones 
(1000 elements or less). Typically, operations that are highly parallelisable, 
such as matrix multiplication, have a constant time cost rather than linear in 
number of elements.

However, the big problem with GPUs is memory management. If you have a problem 
that you're happy to solve entirely on the GPU, you're going to get great 
performance at the cost of less portability... a major consideration for a JVM 
based application. The trick is minimising how much data you need to transmit 
between the traditional CPU memory space and the GPU memory space. And further 
optimisations can be obtained by using the GPU profilers that come with the 
card.

It is for this reason that GPU-backed implementations of BLAS/LAPACK can only 
match, but not surpass, the performance of Intel MKL. There exist BLAS-LIKE and 
LAPACK-LIKE implementations for GPUs (e.g. cuBLAS, clBLAS) but they can only be 
used when you hold pointers to the GPU memory regions and are not good for use 
from Java/Scala (unless you are using macros/code generators to really generate 
native code).

I have links with FPGA companies and I'd love to see a full BLAS implementation 
using that custom hardware... but it's such a mammoth task the FPGA 
implementors (not me) would need to be funded to do it.

I am very hopeful about the cutting edge commodity tech coming from Intel/AMD 
(e.g. APUs) which allow CPU and GPU to share the memory region. I would love to 
buy one of these machines and write a minimal BLAS implementation to do some 
benchmarks and see if we can get GPU performance without the memory transfer 
overhead. My project https://github.com/fommil/multiblas (which was abandoned 
until the tech caught up) would be a perfect place to do this and would involve 
only runtime changes for Spark users to benefit. But, to be honest, I'd 
probably need funding to turn my attention to this because I've got a few other 
personal priorities at the moment.

I've heard the raspberry pi has such a shared region. It might be interesting 
to use it as a cheapo dev environment.

 Support off-loading computations to a GPU
 -

 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5795:
-
Priority: Minor  (was: Major)

When you say doesn't compile, you should show the compilation error. Although 
I think I know what it is. There's a workaround but I agree we can look at 
fixing it. If it breaks binary compatibility, it would have to wait until later.

 api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
 -

 Key: SPARK-5795
 URL: https://issues.apache.org/jira/browse/SPARK-5795
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Minor

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 the following code can't compile on java.
 JavaPairDStreamInteger, Integer rs =
 rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, 
 TextOutputFormat.class, jobConf);
 but similar code in JavaPairRDD works ok.
 JavaPairRDDString, String counts =...
 counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, 
 TextOutputFormat.class, jobConf);
 
 mybe the 
   def saveAsNewAPIHadoopFiles(
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[_ : NewOutputFormat[_, _]],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }
 =
 def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]](
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[F],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5728) MQTTStreamSuite leaves behind ActiveMQ database files

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5728:
-
Fix Version/s: 1.2.2

 MQTTStreamSuite leaves behind ActiveMQ database files
 -

 Key: SPARK-5728
 URL: https://issues.apache.org/jira/browse/SPARK-5728
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.2.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Trivial
 Fix For: 1.3.0, 1.2.2


 I've seen this several times and finally wanted to fix it: 
 {{MQTTStreamSuite}} uses a local ActiveMQ broker, that creates a working dir 
 for its database in the {{external/mqtt}} directory called {{activemq}}. This 
 doesn't get cleaned up, at least often it does not for me. It's trivial to 
 set it to use a temp directory which the test harness does clean up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4832) some other processes might take the daemon pid

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4832.
--
   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0

Issue resolved by pull request 3683
[https://github.com/apache/spark/pull/3683]

 some other processes might take the daemon pid
 --

 Key: SPARK-4832
 URL: https://issues.apache.org/jira/browse/SPARK-4832
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang
Priority: Minor
 Fix For: 1.3.0, 1.2.2


 Some other processes might use the pid saved in pid file. In that case we 
 should ignore it and launch daemons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer

2015-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319839#comment-14319839
 ] 

Sean Owen commented on SPARK-5726:
--

Go ahead and change it; my guess is that Xiangrui is OK with that too but he 
can comment too.

 Hadamard Vector Product Transformer
 ---

 Key: SPARK-5726
 URL: https://issues.apache.org/jira/browse/SPARK-5726
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Octavian Geagla
Assignee: Octavian Geagla

 I originally posted my idea here: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html
 A draft of this feature is implemented, documented, and tested already.  Code 
 is on a branch on my fork here: 
 https://github.com/ogeagla/spark/compare/spark-mllib-weighting
 I'm curious if there is any interest in this feature, in which case I'd 
 appreciate some feedback.  One thing that might be useful is an example/test 
 case using the transformer within the ML pipeline, since there are not any 
 examples which use Vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4832) some other processes might take the daemon pid

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4832:
-
Assignee: Tao Wang

 some other processes might take the daemon pid
 --

 Key: SPARK-4832
 URL: https://issues.apache.org/jira/browse/SPARK-4832
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang
Assignee: Tao Wang
Priority: Minor
 Fix For: 1.3.0, 1.2.2


 Some other processes might use the pid saved in pid file. In that case we 
 should ignore it and launch daemons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4631) Add real unit test for MQTT

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4631:
-
Target Version/s:   (was: 1.3.0)
   Fix Version/s: 1.2.2

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical
 Fix For: 1.3.0, 1.2.2


 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-13 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319948#comment-14319948
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

From  SPARK-5715 
I see a *factor four performance loss* in my Spark jobs when migrating from 
Spark 1.1.0 to Spark 1.2.0 or 1.2.1.

Also, I see an *increase in the size of shuffle writes* (which is also reported 
by Kevin Jung on the mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tt20894.html
 

Together with this I experience a *huge number of disk spills*.



I'm experiencing these with my job under the following circumstances: 

* Spark 1.2.0 with Sort-based Shuffle 
* Spark 1.2.0 with Hash-based Shuffle 
* Spark 1.2.1 with Sort-based Shuffle 

All three combinations show the same behavior, which contrasts from Spark 
1.1.0. 

In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for 
almost four hours. Configuration is identical otherwise - I only added 
org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for 
Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102. 


As a consequence (I think, but causality might be different) I see lots and 
lots of disk spills. 

I cannot provide a small test case, but maybe the log entries for a single 
worker thread can help someone investigate on this. (See below.) 


I will also open up an issue, if nobody stops me by providing an answer ;) 

Any help will be greatly appreciated, because otherwise I'm stuck with Spark 
1.1.0, as quadrupling runtime is not an option. 

Sincerely, 

Chris 



2015-02-09T14:06:06.328+01:00 INFO org.apache.spark.executor.Executor Running 
task 9.0 in stage 18.0 (TID 300) Executor task launch worker-18 
2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.CacheManager Partition 
rdd_35_9 not found, computing it Executor task launch worker-18 
2015-02-09T14:06:06.351+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty 
blocks out of 10 blocks Executor task launch worker-18 
2015-02-09T14:06:06.351+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 0 ms Executor task launch worker-18 
2015-02-09T14:06:07.396+01:00 INFO org.apache.spark.storage.MemoryStore 
ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor task 
launch worker-18 
2015-02-09T14:06:07.397+01:00 INFO org.apache.spark.storage.MemoryStore Block 
rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor task launch 
worker-18 
2015-02-09T14:06:07.398+01:00 INFO org.apache.spark.storage.BlockManagerMaster 
Updated info of block rdd_35_9 Executor task launch worker-18 
2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.CacheManager Partition 
rdd_38_9 not found, computing it Executor task launch worker-18 
2015-02-09T14:06:07.399+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty 
blocks out of 10 blocks Executor task launch worker-18 
2015-02-09T14:06:07.400+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 0 ms Executor task launch worker-18 
2015-02-09T14:06:07.567+01:00 INFO org.apache.spark.storage.MemoryStore 
ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor task 
launch worker-18 
2015-02-09T14:06:07.568+01:00 INFO org.apache.spark.storage.MemoryStore Block 
rdd_38_9 stored as values in memory (estimated size 92... Executor task launch 
worker-18 
2015-02-09T14:06:07.569+01:00 INFO org.apache.spark.storage.BlockManagerMaster 
Updated info of block rdd_38_9 Executor task launch worker-18 
2015-02-09T14:06:07.573+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 34 non-empty 
blocks out of 50 blocks Executor task launch worker-18 
2015-02-09T14:06:07.573+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 1 ms Executor task launch worker-18 
2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.CacheManager Partition 
rdd_41_9 not found, computing it Executor task launch worker-18 
2015-02-09T14:06:38.931+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 3 non-empty blocks 
out of 10 blocks Executor task launch worker-18 
2015-02-09T14:06:38.931+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 0 ms Executor task launch worker-18 
2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore 
ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor task 
launch worker-18 
2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore Block 
rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor task launch 
worker-18 
2015-02-09T14:06:38.946+01:00 INFO org.apache.spark.storage.BlockManagerMaster 
Updated info of block 

[jira] [Resolved] (SPARK-5285) Removed GroupExpression in catalyst

2015-02-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5285.
-
Resolution: Won't Fix

  Removed GroupExpression in catalyst
 

 Key: SPARK-5285
 URL: https://issues.apache.org/jira/browse/SPARK-5285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

  Removed GroupExpression in catalyst



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5518) Error messages for plans with invalid AttributeReferences

2015-02-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5518.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

 Error messages for plans with invalid AttributeReferences
 -

 Key: SPARK-5518
 URL: https://issues.apache.org/jira/browse/SPARK-5518
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.3.0


 It is now possible for users to put invalid attribute references into query 
 plans.  We should check for this case at the end of analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5518) Error messages for plans with invalid AttributeReferences

2015-02-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319718#comment-14319718
 ] 

Michael Armbrust commented on SPARK-5518:
-

https://github.com/apache/spark/pull/4558

 Error messages for plans with invalid AttributeReferences
 -

 Key: SPARK-5518
 URL: https://issues.apache.org/jira/browse/SPARK-5518
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.3.0


 It is now possible for users to put invalid attribute references into query 
 plans.  We should check for this case at the end of analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

2015-02-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319965#comment-14319965
 ] 

Wojciech Pituła edited comment on SPARK-5265 at 2/13/15 11:24 AM:
--

We have the same issue. Such master url works fine with -deploy-mode client 
but breaks with -deploy-mode cluster.


was (Author: krever):
We have the same issue. Such master url works fine with --deploy-mode client 
but breaks with --deploy-mode cluster.

 Submitting applications on Standalone cluster controlled by Zookeeper forces 
 to know active master
 --

 Key: SPARK-5265
 URL: https://issues.apache.org/jira/browse/SPARK-5265
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Roque Vassal'lo
  Labels: cluster, spark-submit, standalone, zookeeper

 Hi, this is my first JIRA here, so I hope it is clear enough.
 I'm using Spark 1.2.0 and trying to submit an application on a Spark 
 Standalone cluster in cluster deploy mode with supervise.
 Standalone cluster is running in high availability mode, using Zookeeper to 
 provide leader election between three available Masters (named master1, 
 master2 and master3).
 As read at Spark's documentation, to register a Worker to the Standalone 
 cluster, I provide complete cluster info as the spark route.
 I mean, spark://master1:7077,master2:7077,master3:7077
 and that route is parsed and three attempts are launched, first one to 
 master1:7077, second one to master2:7077 and third one to master3:7077.
 This works great!
 But if I try to do the same while submitting applications, it fails.
 I mean, if I provide complete cluster info as the --master option to 
 spark-submit script, it throws an exception because it tries to connect as it 
 was a single node.
 Example:
 spark-submit --class org.apache.spark.examples.SparkPi --master 
 spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster 
 --supervise examples.jar 100
 This is the output I got:
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mytest); users 
 with modify permissions: Set(mytest)
 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started
 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on 
 port 53930.
 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
 akka.actor.ActorInitializationException: exception during creation
   at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
   at akka.actor.ActorCell.create(ActorCell.scala:596)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
   at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
   at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
   at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
   at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
   at akka.actor.ActorCell.create(ActorCell.scala:580)
   ... 9 more
 Shouldn't it parse it as on Worker registration?
 It will not force client to know which is the current active Master of the 
 Standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java

2015-02-13 Thread Littlestar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Littlestar updated SPARK-5795:
--
Attachment: TestStreamCompile.java

my testcase on java 1.7 and spark 1.3 trunk.
Thanks.

 api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
 -

 Key: SPARK-5795
 URL: https://issues.apache.org/jira/browse/SPARK-5795
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Minor
 Attachments: TestStreamCompile.java


 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 the following code can't compile on java.
 JavaPairDStreamInteger, Integer rs =
 rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, 
 TextOutputFormat.class, jobConf);
 but similar code in JavaPairRDD works ok.
 JavaPairRDDString, String counts =...
 counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, 
 TextOutputFormat.class, jobConf);
 
 mybe the 
   def saveAsNewAPIHadoopFiles(
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[_ : NewOutputFormat[_, _]],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }
 =
 def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]](
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[F],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2015-02-13 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320038#comment-14320038
 ] 

Peter Rudenko commented on SPARK-4766:
--

Very important feature that could make pretty big speedup. Let me explain why. 
I have a pipeline with 4 transformers and 1 estimator model 
(LogisticRegression) with 3 folds for cross validation and 3 hyper parameters 
in grid search:

{code}
val paramGrid = new ParamGridBuilder()
  .addGrid(model.regParam, Array(0.1, 0.01, 0.001))
  .build()

crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(3)
{code}

Transformers don't have any parameters in grid search. Right now for every 
possible combination of hyperparam + crossvalidation fold it transforms a data 
(with the same transformers) thus creating new RDD with a new ID, but the same 
data. Thus i cannot cache it. What i come with is to use 2 pipelines: 
# Transformer pipeline - transforming once whole data 
# Model pipeline with just a model in it.

I modified [Pipeline|https://issues.apache.org/jira/browse/SPARK-5796] and 
LogisticRegression class (commented instances.unpersist() because the same 
instances would be for each hyperparameter). This reduced the time of 
LogisticRegression Pipeline significantly.

But would be cool to do it in Pipeline: if there's no parameters for 
Transformer stages - just construct a data once and for each hyperparameter in 
estimator pass the same data. Thus for 3 folds it would read and cache data 3 
times ((1 to 3).combination(2)) and wouldn't depend on number of 
Hyperparameters to estimator (now it's doing 9 times 3 folds combination * 3 
model parameters).


 ML Estimator Params should subclass Transformer Params
 --

 Key: SPARK-4766
 URL: https://issues.apache.org/jira/browse/SPARK-4766
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Currently, in spark.ml, both Transformers and Estimators extend the same 
 Params classes.  There should be one Params class for the Transformer and one 
 for the Estimator, where the Estimator params class extends the Transformer 
 one.
 E.g., it is weird to be able to do:
 {code}
 val model: LogisticRegressionModel = ...
 model.getMaxIter()
 {code}
 (This is the only case where this happens currently, but it is worth setting 
 a precedent.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4267:
-
Target Version/s:   (was: 1.3.0)
   Fix Version/s: 1.2.2

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Tsuyoshi OZAWA
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0, 1.2.2


 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   
   
  
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)   
   
 

[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java

2015-02-13 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320109#comment-14320109
 ] 

Littlestar commented on SPARK-5795:
---

Does it same problem as SPARK-5297, thanks.


 api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
 -

 Key: SPARK-5795
 URL: https://issues.apache.org/jira/browse/SPARK-5795
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Minor
 Attachments: TestStreamCompile.java


 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 the following code can't compile on java.
 JavaPairDStreamInteger, Integer rs =
 rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, 
 TextOutputFormat.class, jobConf);
 but similar code in JavaPairRDD works ok.
 JavaPairRDDString, String counts =...
 counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, 
 TextOutputFormat.class, jobConf);
 
 mybe the 
   def saveAsNewAPIHadoopFiles(
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[_ : NewOutputFormat[_, _]],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }
 =
 def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]](
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[F],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs

2015-02-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5252:
-
Component/s: PySpark
 Examples

Looks like you have an environment problem:

{code}
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
{code}

Can you resolve this and then see if you have this problem?

 Streaming StatefulNetworkWordCount example hangs
 

 Key: SPARK-5252
 URL: https://issues.apache.org/jira/browse/SPARK-5252
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark, Streaming
Affects Versions: 1.2.0
 Environment: Ubuntu Linux
Reporter: Lutz Buech
 Attachments: debug.txt


 Running the stateful network word count example in Python (on one local node):
 https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py
 At the beginning, when no data is streamed, empty status outputs are 
 generated, only decorated by the current Time, e.g.:
 ---
 Time: 2015-01-14 17:58:20
 ---
 ---
 Time: 2015-01-14 17:58:21
 ---
 As soon as I stream some data via netcat, no new status updates will show. 
 Instead, one line saying
 [Stage number: 
  (2 + 0) / 3]
 where number is some integer number, e.g. 132. There is no further output 
 on stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5756) Analyzer should not throw scala.NotImplementedError for illegitimate sql

2015-02-13 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei resolved SPARK-5756.

Resolution: Fixed

 Analyzer should not throw scala.NotImplementedError for illegitimate sql
 

 Key: SPARK-5756
 URL: https://issues.apache.org/jira/browse/SPARK-5756
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 ```SELECT CAST(x AS STRING) FROM src```  comes a NotImplementedError:
   CliDriver: scala.NotImplementedError: an implementation is missing
 at scala.Predef$.$qmark$qmark$qmark(Predef.scala:252)
 at 
 org.apache.spark.sql.catalyst.expressions.PrettyAttribute.dataType(namedExpressions.scala:221)
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:30)
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:30)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68)
 at 
 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
 at scala.collection.immutable.List.exists(List.scala:84)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:68)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:56)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:56)
 at 
 org.apache.spark.sql.catalyst.expressions.NamedExpression.typeSuffix(namedExpressions.scala:62)
 at 
 org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:124)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:78)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83)
 at scala.collection.immutable.Stream.map(Stream.scala:376)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:81)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:204)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:79)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java

2015-02-13 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320083#comment-14320083
 ] 

Littlestar commented on SPARK-5795:
---

error info...
The method saveAsNewAPIHadoopFiles(String, String, Class?, Class?, Class? 
extends OutputFormat?,?) in the type JavaPairDStreamInteger,Integer is not 
applicable for the arguments (String, String, ClassInteger, ClassInteger, 
ClassTextOutputFormat)




 api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
 -

 Key: SPARK-5795
 URL: https://issues.apache.org/jira/browse/SPARK-5795
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Minor

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 the following code can't compile on java.
 JavaPairDStreamInteger, Integer rs =
 rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, 
 TextOutputFormat.class, jobConf);
 but similar code in JavaPairRDD works ok.
 JavaPairRDDString, String counts =...
 counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, 
 TextOutputFormat.class, jobConf);
 
 mybe the 
   def saveAsNewAPIHadoopFiles(
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[_ : NewOutputFormat[_, _]],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }
 =
 def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]](
   prefix: String,
   suffix: String,
   keyClass: Class[_],
   valueClass: Class[_],
   outputFormatClass: Class[F],
   conf: Configuration = new Configuration) {
 dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, 
 outputFormatClass, conf)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5799) Compute aggregation function on specified numeric columns

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320352#comment-14320352
 ] 

Apache Spark commented on SPARK-5799:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4592

 Compute aggregation function on specified numeric columns
 -

 Key: SPARK-5799
 URL: https://issues.apache.org/jira/browse/SPARK-5799
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 Compute aggregation function on specified numeric columns. For example:
 val df = Seq((a, 1, 0, b), (b, 2, 4, c), (a, 2, 3, 
 d)).toDataFrame(key, value1, value2, rest)
 df.groupBy(key).min(value2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5799) Compute aggregation function on specified numeric columns

2015-02-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-5799:
--

 Summary: Compute aggregation function on specified numeric columns
 Key: SPARK-5799
 URL: https://issues.apache.org/jira/browse/SPARK-5799
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor


Compute aggregation function on specified numeric columns. For example:

val df = Seq((a, 1, 0, b), (b, 2, 4, c), (a, 2, 3, 
d)).toDataFrame(key, value1, value2, rest)

df.groupBy(key).min(value2)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5798) Spark shell issue

2015-02-13 Thread DeepakVohra (JIRA)
DeepakVohra created SPARK-5798:
--

 Summary: Spark shell issue
 Key: SPARK-5798
 URL: https://issues.apache.org/jira/browse/SPARK-5798
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
 Environment: Spark 1.2
Scala 2.10.4
Reporter: DeepakVohra


The Spark shell terminates when Spark code is run indicating an issue with 
Spark shell.

The error is coming from the spark shell file
 
  /apachespark/spark-1.2.0-bin-cdh4/bin/spark-shell: line 48
 
  $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main
  ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-02-13 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320309#comment-14320309
 ] 

Mark Khaitman commented on SPARK-5782:
--

Would it make sense to instead make the _next_limit return the MIN of the 2 
values as opposed to the MAX?

 Python Worker / Pyspark Daemon Memory Issue
 ---

 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.3.0, 1.2.1, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman

 I'm including the Shuffle component on this, as a brief scan through the code 
 (which I'm not 100% familiar with just yet) shows a large amount of memory 
 handling in it:
 It appears that any type of join between two RDDs spawns up twice as many 
 pyspark.daemon workers compared to the default 1 task - 1 core configuration 
 in our environment. This can become problematic in the cases where you build 
 up a tree of RDD joins, since the pyspark.daemons do not cease to exist until 
 the top level join is completed (or so it seems)... This can lead to memory 
 exhaustion by a single framework, even though is set to have a 512MB python 
 worker memory limit and few gigs of executor memory.
 Another related issue to this is that the individual python workers are not 
 supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
 spill to disk.
 Some of our python workers are somehow reaching 2GB each (which when 
 multiplied by the number of cores per executor * the number of joins 
 occurring in some cases), causing the Out-of-Memory killer to step up to its 
 unfortunate job! :(
 I think with the _next_limit method in shuffle.py, if the current memory 
 usage is close to the memory limit, then a 1.05 multiplier can endlessly 
 cause more memory to be consumed by the single python worker, since the max 
 of (512 vs 511 * 1.05) would end up blowing up towards the latter of the 
 two... Shouldn't the memory limit be the absolute cap in this case?
 I've only just started looking into the code, and would definitely love to 
 contribute towards Spark, though I figured it might be quicker to resolve if 
 someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-02-13 Thread Mark Khaitman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Khaitman updated SPARK-5782:
-
Priority: Critical  (was: Major)

 Python Worker / Pyspark Daemon Memory Issue
 ---

 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.3.0, 1.2.1, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman
Priority: Critical

 I'm including the Shuffle component on this, as a brief scan through the code 
 (which I'm not 100% familiar with just yet) shows a large amount of memory 
 handling in it:
 It appears that any type of join between two RDDs spawns up twice as many 
 pyspark.daemon workers compared to the default 1 task - 1 core configuration 
 in our environment. This can become problematic in the cases where you build 
 up a tree of RDD joins, since the pyspark.daemons do not cease to exist until 
 the top level join is completed (or so it seems)... This can lead to memory 
 exhaustion by a single framework, even though is set to have a 512MB python 
 worker memory limit and few gigs of executor memory.
 Another related issue to this is that the individual python workers are not 
 supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
 spill to disk.
 Some of our python workers are somehow reaching 2GB each (which when 
 multiplied by the number of cores per executor * the number of joins 
 occurring in some cases), causing the Out-of-Memory killer to step up to its 
 unfortunate job! :(
 I think with the _next_limit method in shuffle.py, if the current memory 
 usage is close to the memory limit, then a 1.05 multiplier can endlessly 
 cause more memory to be consumed by the single python worker, since the max 
 of (512 vs 511 * 1.05) would end up blowing up towards the latter of the 
 two... Shouldn't the memory limit be the absolute cap in this case?
 I've only just started looking into the code, and would definitely love to 
 contribute towards Spark, though I figured it might be quicker to resolve if 
 someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer

2015-02-13 Thread Octavian Geagla (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320514#comment-14320514
 ] 

Octavian Geagla commented on SPARK-5726:


Ok, I've made the change on the PR.  Thanks, Sean!

 Hadamard Vector Product Transformer
 ---

 Key: SPARK-5726
 URL: https://issues.apache.org/jira/browse/SPARK-5726
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Octavian Geagla
Assignee: Octavian Geagla

 I originally posted my idea here: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html
 A draft of this feature is implemented, documented, and tested already.  Code 
 is on a branch on my fork here: 
 https://github.com/ogeagla/spark/compare/spark-mllib-weighting
 I'm curious if there is any interest in this feature, in which case I'd 
 appreciate some feedback.  One thing that might be useful is an example/test 
 case using the transformer within the ML pipeline, since there are not any 
 examples which use Vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5345) Fix unstable test case in FsHistoryProviderSuite

2015-02-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5345.
---
Resolution: Fixed

It looks like this has been fixed by SPARK-5600, so I'm going to resolve this 
for now.  Let's re-open if the test becomes flaky again.

 Fix unstable test case in FsHistoryProviderSuite
 

 Key: SPARK-5345
 URL: https://issues.apache.org/jira/browse/SPARK-5345
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
  Labels: flaky-test

 In FsHistoryProviderSuite, a test Parse new and old application logs 
 sometimes fail and sometimes succeed. It's unstable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5735) Replace uses of EasyMock with Mockito

2015-02-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5735.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Replace uses of EasyMock with Mockito
 -

 Key: SPARK-5735
 URL: https://issues.apache.org/jira/browse/SPARK-5735
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Josh Rosen
 Fix For: 1.3.0


 There are a few reasons we should drop EasyMock. First, we should have a 
 single mocking framework in our tests in general to keep things consistent. 
 Second, EasyMock has caused us some dependency pain in our tests due to 
 objenesis. We aren't totally sure but suspect such conflicts might be causing 
 non deterministic test failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5802) Cache scaled data in GLM

2015-02-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5802:


 Summary: Cache scaled data in GLM
 Key: SPARK-5802
 URL: https://issues.apache.org/jira/browse/SPARK-5802
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


If we modify the input data (to append bias or to scale features), we should 
cache the output to avoid recomputing transformed vectors each time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader

2015-02-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320504#comment-14320504
 ] 

Marcelo Vanzin commented on SPARK-5770:
---

bq. but the classloader still load the old one.

Could you clarify what that means? Due to the way class loading works, if you 
reference a class that has already been loaded, you won't get the new one, but 
the one already loaded. Which is one reason why this addJar() can overwrite 
existing jars functionality is a little sketchy.

 Use addJar() to upload a new jar file to executor, it can't be added to 
 classloader
 ---

 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 First use addJar() to upload a jar to the executor, then change the jar 
 content and upload it again. We can see the jar file in the local has be 
 updated, but the classloader still load the old one. The executor log has no 
 error or exception to point it.
 I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5785) Pyspark does not support narrow dependencies

2015-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5785:

Description: 
joins ( cogroups etc.) are always considered to have wide dependencies in 
pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
this simple job should shuffle rddA  rddB once each, but it also will do a 
third shuffle of the unioned data:

{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

joined = rddA.join(rddB)
joined.count()

 rddA._partitionFunc == rddB._partitionFunc
True
{code}


(Or the docs should somewhere explain that this feature is missing from 
pyspark.)

  was:
joins ( cogroups etc.) are always considered to have wide dependencies in 
pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
this simple job should shuffle rddA  rddB once each, but it also will do a 
third shuffle of the unioned data:

{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

joined = rddA.join(rddB)
joined.count()

 rddA._partitionFunc == rddB._partitionFunc
True
{code}


(Or the docs should somewhere explain that this feature is missing from spark.)


 Pyspark does not support narrow dependencies
 

 Key: SPARK-5785
 URL: https://issues.apache.org/jira/browse/SPARK-5785
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Imran Rashid

 joins ( cogroups etc.) are always considered to have wide dependencies in 
 pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
 this simple job should shuffle rddA  rddB once each, but it also will do a 
 third shuffle of the unioned data:
 {code}
 rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
 rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
 joined = rddA.join(rddB)
 joined.count()
  rddA._partitionFunc == rddB._partitionFunc
 True
 {code}
 (Or the docs should somewhere explain that this feature is missing from 
 pyspark.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5801) Shuffle creates too many nested directories

2015-02-13 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-5801:
--
Component/s: Shuffle

 Shuffle creates too many nested directories
 ---

 Key: SPARK-5801
 URL: https://issues.apache.org/jira/browse/SPARK-5801
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.1
Reporter: Kay Ousterhout

 When running Spark on EC2, there are 4 nested shuffle directories before the 
 hashed directory names, for example:
 /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c
 My understanding is that this should look like:
 /mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c
 This happened when I was using the sort-based shuffle (all default 
 configurations for Spark on EC2).
 This is not a correctness problem (the shuffle still works fine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4903) RDD remains cached after DROP TABLE

2015-02-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320439#comment-14320439
 ] 

Yin Huai commented on SPARK-4903:
-

I believe that it has been resolved in 1.3 ([see 
this|https://github.com/apache/spark/blob/v1.3.0-snapshot1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L61]).
 I tried the following snippet in build/sbt -Phive sparkShelland verified the 
cached RDD was unpersisted after I dropped it. 

{code}
sqlContext.jsonRDD(sc.parallelize({a:1}::Nil)).registerTempTable(test)
sqlContext.sql(create table jt as select a from test)
sqlContext.sql(cache table jt).collect
sqlContext.sql(select * from jt).collect
sqlContext.sql(drop table jt).collect
{code}

 RDD remains cached after DROP TABLE
 -

 Key: SPARK-4903
 URL: https://issues.apache.org/jira/browse/SPARK-4903
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark master @ Dec 17 
 (3cd516191baadf8496ccdae499771020e89acd7e)
Reporter: Evert Lammerts
Priority: Critical

 In beeline, when I run:
 {code:sql}
 CREATE TABLE test AS select col from table;
 CACHE TABLE test
 DROP TABLE test
 {code}
 The the table is removed but the RDD is still cached. Running UNCACHE is not 
 possible anymore (table not found from metastore).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5732) Add an option to print the spark version in spark script

2015-02-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5732.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: uncleGen

 Add an option to print the spark version in spark script
 

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.0.0
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.3.0


 Naturally, we may need to add an option to print the spark version in spark 
 script. It is pretty common in many script tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-02-13 Thread Mark Khaitman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Khaitman updated SPARK-5782:
-
Comment: was deleted

(was: Would it make sense to instead make the _next_limit return the MIN of the 
2 values as opposed to the MAX?)

 Python Worker / Pyspark Daemon Memory Issue
 ---

 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.3.0, 1.2.1, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman

 I'm including the Shuffle component on this, as a brief scan through the code 
 (which I'm not 100% familiar with just yet) shows a large amount of memory 
 handling in it:
 It appears that any type of join between two RDDs spawns up twice as many 
 pyspark.daemon workers compared to the default 1 task - 1 core configuration 
 in our environment. This can become problematic in the cases where you build 
 up a tree of RDD joins, since the pyspark.daemons do not cease to exist until 
 the top level join is completed (or so it seems)... This can lead to memory 
 exhaustion by a single framework, even though is set to have a 512MB python 
 worker memory limit and few gigs of executor memory.
 Another related issue to this is that the individual python workers are not 
 supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
 spill to disk.
 Some of our python workers are somehow reaching 2GB each (which when 
 multiplied by the number of cores per executor * the number of joins 
 occurring in some cases), causing the Out-of-Memory killer to step up to its 
 unfortunate job! :(
 I think with the _next_limit method in shuffle.py, if the current memory 
 usage is close to the memory limit, then a 1.05 multiplier can endlessly 
 cause more memory to be consumed by the single python worker, since the max 
 of (512 vs 511 * 1.05) would end up blowing up towards the latter of the 
 two... Shouldn't the memory limit be the absolute cap in this case?
 I've only just started looking into the code, and would definitely love to 
 contribute towards Spark, though I figured it might be quicker to resolve if 
 someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5529:
-
Component/s: YARN

 Executor is still hold while BlockManager has been removed
 --

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-02-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5529:
-
Assignee: Hong Shen

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters

2015-02-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320499#comment-14320499
 ] 

Michael Armbrust commented on SPARK-5296:
-

Oh, good point... We should pass down nested ANDs

 Predicate Pushdown (BaseRelation) to have an interface that will accept OR 
 filters
 --

 Key: SPARK-5296
 URL: https://issues.apache.org/jira/browse/SPARK-5296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet
Assignee: Cheng Lian
Priority: Critical

 Currently, the BaseRelation API allows a FilteredRelation to handle an 
 Array[Filter] which represents filter expressions that are applied as an AND 
 operator.
 We should support OR operations in a BaseRelation as well. I'm not sure what 
 this would look like in terms of API changes, but it almost seems like a 
 FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would 
 be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader

2015-02-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320510#comment-14320510
 ] 

Sean Owen commented on SPARK-5770:
--

Yeah I think that's the point, that overwriting an existing JAR won't cause any 
classes to be reloaded, so, should it be an error? or a warning?

 Use addJar() to upload a new jar file to executor, it can't be added to 
 classloader
 ---

 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 First use addJar() to upload a jar to the executor, then change the jar 
 content and upload it again. We can see the jar file in the local has be 
 updated, but the classloader still load the old one. The executor log has no 
 error or exception to point it.
 I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5626) Spurious test failures due to NullPointerException in EasyMock test code

2015-02-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5626.
---
Resolution: Fixed

 Spurious test failures due to NullPointerException in EasyMock test code
 

 Key: SPARK-5626
 URL: https://issues.apache.org/jira/browse/SPARK-5626
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test
 Attachments: consoleText.txt


 I've seen a few cases where a test failure will trigger a cascade of spurious 
 failures when instantiating test suites that use EasyMock.  Here's a sample 
 symptom:
 {code}
 [info] CacheManagerSuite:
 [info] Exception encountered when attempting to run a suite with class name: 
 org.apache.spark.CacheManagerSuite *** ABORTED *** (137 milliseconds)
 [info]   java.lang.NullPointerException:
 [info]   at 
 org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52)
 [info]   at 
 org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90)
 [info]   at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
 [info]   at org.objenesis.ObjenesisHelper.newInstance(ObjenesisHelper.java:43)
 [info]   at 
 org.easymock.internal.ObjenesisClassInstantiator.newInstance(ObjenesisClassInstantiator.java:26)
 [info]   at 
 org.easymock.internal.ClassProxyFactory.createProxy(ClassProxyFactory.java:219)
 [info]   at 
 org.easymock.internal.MocksControl.createMock(MocksControl.java:59)
 [info]   at org.easymock.EasyMock.createMock(EasyMock.java:103)
 [info]   at 
 org.scalatest.mock.EasyMockSugar$class.mock(EasyMockSugar.scala:267)
 [info]   at 
 org.apache.spark.CacheManagerSuite.mock(CacheManagerSuite.scala:28)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply$mcV$sp(CacheManagerSuite.scala:40)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38)
 [info]   at 
 org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195)
 [info]   at 
 org.apache.spark.CacheManagerSuite.runTest(CacheManagerSuite.scala:28)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 [info]   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
 [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
 [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
 [info]   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
 [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
 [info]   at 
 org.apache.spark.CacheManagerSuite.org$scalatest$BeforeAndAfter$$super$run(CacheManagerSuite.scala:28)
 [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
 [info]   at org.apache.spark.CacheManagerSuite.run(CacheManagerSuite.scala:28)
 [info]   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
 [info]   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
 [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is from 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26852/consoleFull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5803) Use ArrayBuilder instead of ArrayBuffer for primitive types

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320528#comment-14320528
 ] 

Apache Spark commented on SPARK-5803:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4594

 Use ArrayBuilder instead of ArrayBuffer for primitive types
 ---

 Key: SPARK-5803
 URL: https://issues.apache.org/jira/browse/SPARK-5803
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 ArrayBuffer is not specialized and hence it boxes primitive-typed values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5798) Spark shell issue

2015-02-13 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320533#comment-14320533
 ] 

DeepakVohra commented on SPARK-5798:


Thanks Sean for testing. 

Not all Spark/Scala code generates an error in Spark Shell. 

For example, run all pre-requisite import, var, and method code and 
subsequently run the following code to test:
model(sc, rawUserArtistData, rawArtistData, rawArtistAlias)

from:
https://github.com/sryza/aas/blob/master/ch03-recommender/src/main/scala/com/cloudera/datascience/recommender/RunRecommender.scala

Data files are local to Spark/Scala and not in HDFS. 

Environment is different: Oracle Linux 6.5, but should't be a factor. 

If the preceding test also does not generate an error would agree it is some 
other factor and not a bug. 

 Spark shell issue
 -

 Key: SPARK-5798
 URL: https://issues.apache.org/jira/browse/SPARK-5798
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
 Environment: Spark 1.2
 Scala 2.10.4
Reporter: DeepakVohra

 The Spark shell terminates when Spark code is run indicating an issue with 
 Spark shell.
 The error is coming from the spark shell file
  
   /apachespark/spark-1.2.0-bin-cdh4/bin/spark-shell: line 48
  
   $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main
   ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5503) Example code for Power Iteration Clustering

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5503.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Example code for Power Iteration Clustering
 ---

 Key: SPARK-5503
 URL: https://issues.apache.org/jira/browse/SPARK-5503
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Examples, MLlib
Reporter: Xiangrui Meng
Assignee: Stephen Boesch
 Fix For: 1.3.0


 There are two places we need to put examples:
 1. In the user guide, we should be a small example (as in the unit test).
 2. Under examples/, we can have something fancy but still need to keep it 
 minimal.
 3. The user guide contains some out-of-date links, which needs to be updated 
 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5801) Shuffle creates too many nested directories

2015-02-13 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-5801:
-

 Summary: Shuffle creates too many nested directories
 Key: SPARK-5801
 URL: https://issues.apache.org/jira/browse/SPARK-5801
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Kay Ousterhout


When running Spark on EC2, there are 4 nested shuffle directories before the 
hashed directory names, for example:

/mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/spark-675133f0-b2c8-44a1-8775-5e394674609b/spark-69c1ea15-4e7f-454a-9f57-19763c7bdd17/spark-b036335c-60fa-48ab-a346-f1b420af2027/0c

My understanding is that this should look like:

/mnt/spark/spark-5824d912-25af-4187-bc6a-29ae42cd78e5/0c

This happened when I was using the sort-based shuffle (all default 
configurations for Spark on EC2).

This is not a correctness problem (the shuffle still works fine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-02-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5529:
-
Summary: BlockManager heartbeat expiration does not kill executor  (was: 
Executor is still hold while BlockManager has been removed)

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5626) Spurious test failures due to NullPointerException in EasyMock test code

2015-02-13 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320519#comment-14320519
 ] 

Josh Rosen commented on SPARK-5626:
---

This should hopefully be fixed now that I've merged SPARK-5735 to remove 
EasyMock.  I'm going to resolve this issue for now, but let's re-open it if we 
observe this flakiness again.

 Spurious test failures due to NullPointerException in EasyMock test code
 

 Key: SPARK-5626
 URL: https://issues.apache.org/jira/browse/SPARK-5626
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test
 Attachments: consoleText.txt


 I've seen a few cases where a test failure will trigger a cascade of spurious 
 failures when instantiating test suites that use EasyMock.  Here's a sample 
 symptom:
 {code}
 [info] CacheManagerSuite:
 [info] Exception encountered when attempting to run a suite with class name: 
 org.apache.spark.CacheManagerSuite *** ABORTED *** (137 milliseconds)
 [info]   java.lang.NullPointerException:
 [info]   at 
 org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52)
 [info]   at 
 org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90)
 [info]   at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
 [info]   at org.objenesis.ObjenesisHelper.newInstance(ObjenesisHelper.java:43)
 [info]   at 
 org.easymock.internal.ObjenesisClassInstantiator.newInstance(ObjenesisClassInstantiator.java:26)
 [info]   at 
 org.easymock.internal.ClassProxyFactory.createProxy(ClassProxyFactory.java:219)
 [info]   at 
 org.easymock.internal.MocksControl.createMock(MocksControl.java:59)
 [info]   at org.easymock.EasyMock.createMock(EasyMock.java:103)
 [info]   at 
 org.scalatest.mock.EasyMockSugar$class.mock(EasyMockSugar.scala:267)
 [info]   at 
 org.apache.spark.CacheManagerSuite.mock(CacheManagerSuite.scala:28)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply$mcV$sp(CacheManagerSuite.scala:40)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38)
 [info]   at 
 org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195)
 [info]   at 
 org.apache.spark.CacheManagerSuite.runTest(CacheManagerSuite.scala:28)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 [info]   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
 [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
 [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
 [info]   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
 [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
 [info]   at 
 org.apache.spark.CacheManagerSuite.org$scalatest$BeforeAndAfter$$super$run(CacheManagerSuite.scala:28)
 [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
 [info]   at org.apache.spark.CacheManagerSuite.run(CacheManagerSuite.scala:28)
 [info]   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
 [info]   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
 [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is from 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26852/consoleFull.



--
This message was sent by Atlassian JIRA

[jira] [Created] (SPARK-5803) Use ArrayBuilder instead of ArrayBuffer for primitive types

2015-02-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5803:


 Summary: Use ArrayBuilder instead of ArrayBuffer for primitive 
types
 Key: SPARK-5803
 URL: https://issues.apache.org/jira/browse/SPARK-5803
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


ArrayBuffer is not specialized and hence it boxes primitive-typed values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5805) Fix the type error in the final example given in MLlib - Clustering documentation

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320680#comment-14320680
 ] 

Apache Spark commented on SPARK-5805:
-

User 'emres' has created a pull request for this issue:
https://github.com/apache/spark/pull/4596

 Fix the type error in the final example given in MLlib - Clustering 
 documentation
 -

 Key: SPARK-5805
 URL: https://issues.apache.org/jira/browse/SPARK-5805
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.2.0, 1.2.1
Reporter: Emre Sevinç
Priority: Minor
  Labels: documentation, easyfix, newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 The final example in [MLlib - 
 Clustering|http://spark.apache.org/docs/1.2.0/mllib-clustering.html] 
 documentation has a code line that leads to a type error. 
 The problematic line reads as:
 {code}
 model.predictOnValues(testData).print()
 {code}
 but it should be
 {code}
 model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5805) Fix the type error in the final example given in MLlib - Clustering documentation

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5805:
-
Assignee: Emre Sevinç

 Fix the type error in the final example given in MLlib - Clustering 
 documentation
 -

 Key: SPARK-5805
 URL: https://issues.apache.org/jira/browse/SPARK-5805
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.2.0, 1.2.1
Reporter: Emre Sevinç
Assignee: Emre Sevinç
Priority: Minor
  Labels: documentation, easyfix, newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 The final example in [MLlib - 
 Clustering|http://spark.apache.org/docs/1.2.0/mllib-clustering.html] 
 documentation has a code line that leads to a type error. 
 The problematic line reads as:
 {code}
 model.predictOnValues(testData).print()
 {code}
 but it should be
 {code}
 model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5746) INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table.

2015-02-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320562#comment-14320562
 ] 

Yin Huai commented on SPARK-5746:
-

For now, we will throw an error when we find this case.

 INSERT OVERWRITE throws FileNotFoundException when the source and destination 
 point to the same table.
 --

 Key: SPARK-5746
 URL: https://issues.apache.org/jira/browse/SPARK-5746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 With the newly introduced write support of data source API, {{JSONRelation}} 
 and {{ParquetRelation2}} both suffer this bug.
 The root cause is that we removed the source table before insertion 
 ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]).
 The correct solution should be first insert into a temporary folder, and then 
 overwrite the source table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5806) Organize sections in mllib-clustering.md

2015-02-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5806:


 Summary: Organize sections in mllib-clustering.md
 Key: SPARK-5806
 URL: https://issues.apache.org/jira/browse/SPARK-5806
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5804) Explicitly manage cache in Crossvalidation k-fold loop

2015-02-13 Thread Peter Rudenko (JIRA)
Peter Rudenko created SPARK-5804:


 Summary: Explicitly manage cache in Crossvalidation k-fold loop
 Key: SPARK-5804
 URL: https://issues.apache.org/jira/browse/SPARK-5804
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Peter Rudenko
Priority: Minor


On a big dataset explicitly unpersist train and validation folds allows to load 
more data into memory in the next loop iteration. On my environment (single 
node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved 
more than 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles

2015-02-13 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320653#comment-14320653
 ] 

Josh Rosen commented on SPARK-5227:
---

I think this might be caused by HADOOP-8490: the test code might be getting a 
cached FileSystem instance that was created by an earlier test run, causing the 
configuration from the earlier test to be re-used here.  We could try to 
completely disable this caching, but this could have a large negative 
performance impact on Hadoop library code which assumes that FileSystem 
creation is cheap.  I wonder if there's a way that we can clear this cache in 
between our test runs, which would at least address the test-flakiness issues.

 InputOutputMetricsSuite input metrics when reading text file with multiple 
 splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 
 profiles
 -

 Key: SPARK-5227
 URL: https://issues.apache.org/jira/browse/SPARK-5227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Josh Rosen
Priority: Blocker
  Labels: flaky-test

 The InputOutputMetricsSuite  input metrics when reading text file with 
 multiple splits test has been failing consistently in our new {{branch-1.2}} 
 Jenkins SBT build: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/
 Here's the error message
 {code}
 ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 

[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader

2015-02-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320552#comment-14320552
 ] 

Marcelo Vanzin commented on SPARK-5770:
---

It might be possible to fix the behavior, although even then the results might 
be sketchy.

Basically, when overwriting jars, you'd have to replace the executor's class 
loader. That means you need to keep track of the jars added to the class 
loader, and when adding a new jar, you place it in front of the others and use 
Thread.currentThread().setContextClassLoader() to replace the class loader.

But that's after like 5 seconds of thinking, so there may be a lot of corner 
cases in doing that. I think the best approach would be to say that overwriting 
jars is not allowed, even if that doesn't cover all cases. You could still add 
a different jar that tries to override already loaded classes, and that will 
have the same confusing effect of the old classes being still used.

 Use addJar() to upload a new jar file to executor, it can't be added to 
 classloader
 ---

 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 First use addJar() to upload a jar to the executor, then change the jar 
 content and upload it again. We can see the jar file in the local has be 
 updated, but the classloader still load the old one. The executor log has no 
 error or exception to point it.
 I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5804) Explicitly manage cache in Crossvalidation k-fold loop

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320607#comment-14320607
 ] 

Apache Spark commented on SPARK-5804:
-

User 'petro-rudenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/4595

 Explicitly manage cache in Crossvalidation k-fold loop
 --

 Key: SPARK-5804
 URL: https://issues.apache.org/jira/browse/SPARK-5804
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Peter Rudenko
Priority: Minor

 On a big dataset explicitly unpersist train and validation folds allows to 
 load more data into memory in the next loop iteration. On my environment 
 (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross 
 validation), saved more than 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-13 Thread Chris Love (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320874#comment-14320874
 ] 

Chris Love commented on SPARK-3821:
---

I notice that the packer built ami comes with java7, how would your recommend 
handling java8?  Should both be installed?  

Also which aws linux were the new ami's built off of?  Will this be in a 1.2.x 
branch or just 1.3?

Thanks

Chris

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5779) Python broadcast does not work with Kryo serializer

2015-02-13 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320897#comment-14320897
 ] 

Davies Liu commented on SPARK-5779:
---

Yes, I will close it.

 Python broadcast does not work with Kryo serializer
 ---

 Key: SPARK-5779
 URL: https://issues.apache.org/jira/browse/SPARK-5779
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1
Reporter: Davies Liu
Priority: Critical

 The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5730) Group methods in the generated doc for spark.ml algorithms.

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320908#comment-14320908
 ] 

Apache Spark commented on SPARK-5730:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4600

 Group methods in the generated doc for spark.ml algorithms.
 ---

 Key: SPARK-5730
 URL: https://issues.apache.org/jira/browse/SPARK-5730
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 In spark.ml, we have params and their setters/getters. It is nice to group 
 them in the generated docs. Params should be in the top, while 
 setters/getters should be at the bottom.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5812) Potential flaky test JavaAPISuite.glom

2015-02-13 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-5812:


 Summary: Potential flaky test JavaAPISuite.glom
 Key: SPARK-5812
 URL: https://issues.apache.org/jira/browse/SPARK-5812
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.0
Reporter: Tathagata Das


https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27455/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5805) Fix the type error in the final example given in MLlib - Clustering documentation

2015-02-13 Thread JIRA
Emre Sevinç created SPARK-5805:
--

 Summary: Fix the type error in the final example given in MLlib - 
Clustering documentation
 Key: SPARK-5805
 URL: https://issues.apache.org/jira/browse/SPARK-5805
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.2.1, 1.2.0
Reporter: Emre Sevinç
Priority: Minor


The final example in [MLlib - 
Clustering|http://spark.apache.org/docs/1.2.0/mllib-clustering.html] 
documentation has a code line that leads to a type error. 

The problematic line reads as:

{code}
model.predictOnValues(testData).print()
{code}

but it should be

{code}
model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5806) Organize sections in mllib-clustering.md

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5806:
-
Description: We separate code examples from algorithm descriptions. It 
would be better if we put the example code close to each algorithm description.

 Organize sections in mllib-clustering.md
 

 Key: SPARK-5806
 URL: https://issues.apache.org/jira/browse/SPARK-5806
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We separate code examples from algorithm descriptions. It would be better if 
 we put the example code close to each algorithm description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5731:
---
Priority: Blocker  (was: Major)

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Closed] (SPARK-5779) Python broadcast does not work with Kryo serializer

2015-02-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-5779.
-
Resolution: Duplicate

 Python broadcast does not work with Kryo serializer
 ---

 Key: SPARK-5779
 URL: https://issues.apache.org/jira/browse/SPARK-5779
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1
Reporter: Davies Liu
Priority: Critical

 The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320903#comment-14320903
 ] 

Apache Spark commented on SPARK-5227:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4599

 InputOutputMetricsSuite input metrics when reading text file with multiple 
 splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 
 profiles
 -

 Key: SPARK-5227
 URL: https://issues.apache.org/jira/browse/SPARK-5227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Josh Rosen
Priority: Blocker
  Labels: flaky-test

 The InputOutputMetricsSuite  input metrics when reading text file with 
 multiple splits test has been failing consistently in our new {{branch-1.2}} 
 Jenkins SBT build: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/
 Here's the error message
 {code}
 ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320905#comment-14320905
 ] 

Nicholas Chammas commented on SPARK-3821:
-

If you want Java 8 alongside 7, you can install both to separate paths. For 
spark-ec2's purposes, we only need 7.

The AMIs used as the base are [defined in the Packer 
template|https://github.com/nchammas/spark-ec2/blob/0f313de64ad9542d1a0f0d6f27131ca4bc01d8c3/image-build/spark-packer-template.json#L5-L6].
 The generated AMIs do not include Spark itself--just its dependencies, plus 
related tools for spark-ec2.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5679) Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads and input metrics with mixed read method

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320904#comment-14320904
 ] 

Apache Spark commented on SPARK-5679:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4599

 Flaky tests in InputOutputMetricsSuite: input metrics with interleaved reads 
 and input metrics with mixed read method 
 --

 Key: SPARK-5679
 URL: https://issues.apache.org/jira/browse/SPARK-5679
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Kostas Sakellis
  Labels: flaky-test

 Please audit these and see if there are any assumptions with respect to File 
 IO that might not hold in all cases. I'm happy to help if you can't find 
 anything.
 These both failed in the same run:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/38/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 {code}
 org.apache.spark.metrics.InputOutputMetricsSuite.input metrics with mixed 
 read method
 Failing for the past 13 builds (Since Failed#26 )
 Took 48 sec.
 Error Message
 2030 did not equal 6496
 Stacktrace
 sbt.ForkMain$ForkError: 2030 did not equal 6496
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply$mcV$sp(InputOutputMetricsSuite.scala:135)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite$$anonfun$9.apply(InputOutputMetricsSuite.scala:113)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfter$$super$runTest(InputOutputMetricsSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.runTest(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.metrics.InputOutputMetricsSuite.org$scalatest$BeforeAndAfterAll$$super$run(InputOutputMetricsSuite.scala:46)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Updated] (SPARK-5779) Python broadcast does not work with Kryo serializer

2015-02-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5779:
--
Affects Version/s: (was: 1.2.1)
   (was: 1.3.0)
   1.2.0

 Python broadcast does not work with Kryo serializer
 ---

 Key: SPARK-5779
 URL: https://issues.apache.org/jira/browse/SPARK-5779
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Critical

 The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5812) Potential flaky test JavaAPISuite.glom

2015-02-13 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5812:
-
Labels: flaky-test  (was: )

 Potential flaky test JavaAPISuite.glom
 --

 Key: SPARK-5812
 URL: https://issues.apache.org/jira/browse/SPARK-5812
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.0
Reporter: Tathagata Das
  Labels: flaky-test

 https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27455/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320935#comment-14320935
 ] 

Xiangrui Meng commented on SPARK-5016:
--

I think we should compute the inverse in parallel. In 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L166,
 we don't collect to local by use aggregateByKey to save the sums to reducers. 
Then on each reducer, we update the Guassians, and finally collect them to the 
driver.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5806) Organize sections in mllib-clustering.md

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5806.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4598
[https://github.com/apache/spark/pull/4598]

 Organize sections in mllib-clustering.md
 

 Key: SPARK-5806
 URL: https://issues.apache.org/jira/browse/SPARK-5806
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 We separate code examples from algorithm descriptions. It would be better if 
 we put the example code close to each algorithm description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320739#comment-14320739
 ] 

Patrick Wendell commented on SPARK-5731:


[~c...@koeninger.org] [~tdas] FYI we've disabled this test because it's caused 
a huge productivity loss to ongoing development with frequent failures. Please 
try to get this test into good shape ASAP - otherwise this code will be 
untested.

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 

[jira] [Updated] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5731:
---
Labels: flaky-test  (was: )

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Updated] (SPARK-5807) Parallel grid search

2015-02-13 Thread Peter Rudenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Rudenko updated SPARK-5807:
-
Description: 
Right now in CrossValidator for each fold combination and ParamGrid 
hyperparameter pair it searches the best parameter sequentially. Assuming 
there's enough workers  memory on a cluster to cache all training/validation 
folds it's possible to parallelize execution. Here's a draft i came with:

{code}
import scala.collection.immutable.{ Vector = ScalaVec }



val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe
val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex

def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
  case ((training, validation), splitIndex) = {
val trainingDataset = sqlCtx.applySchema(training, schema).cache()
val validationDataset = sqlCtx.applySchema(validation, schema).cache()
// multi-model training
logDebug(sTrain split $splitIndex with multiple sets of parameters.)
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
var i = 0
trainingDataset.unpersist()
while (i  numModels) {
  val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)), map)
  logDebug(sGot metric $metric for model trained with ${epm(i)}.)
  metrics(i) += metric
i += 1
}
validationDataset.unpersist()
  }
}

if (parallel) {
splits.par.foreach(processFold)
} else {
splits.foreach(processFold)
}
{code}

Assuming there's 3 folds it would redundantly cache all the combinations 
(pretty much memory), so maybe it's possible to cache each fold separately.

  was:
Right now in CrossValidator for each fold combination and ParamGrid 
hyperparameter pair it searches the best parameter sequentially. Assuming 
there's enough workers  memory on a cluster to cache all training/validation 
folds it's possible to parallelize execution. Here's a draft i came with:

{code}
import scala.collection.immutable.{ Vector = ScalaVec }



val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe
val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex

def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
  case ((training, validation), splitIndex) = {
val trainingDataset = sqlCtx.applySchema(training, schema).cache()
val validationDataset = sqlCtx.applySchema(validation, schema).cache()
// multi-model training
logDebug(sTrain split $splitIndex with multiple sets of parameters.)
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
var i = 0
trainingDataset.unpersist()
while (i  numModels) {
  val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)), map)
  logDebug(sGot metric $metric for model trained with ${epm(i)}.)
  metrics(i) += metric
  i += 1
}
validationDataset.unpersist()
  }
}

if (parallel) {
  splits.par.foreach(processFold)
} else {
  splits.foreach(processFold)
}
{code}

Assuming there's 3 folds it would redundantly cache all the combinations 
(pretty much memory), so maybe it's possible to cache each fold separately.


 Parallel grid search 
 -

 Key: SPARK-5807
 URL: https://issues.apache.org/jira/browse/SPARK-5807
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Peter Rudenko
Priority: Minor

 Right now in CrossValidator for each fold combination and ParamGrid 
 hyperparameter pair it searches the best parameter sequentially. Assuming 
 there's enough workers  memory on a cluster to cache all training/validation 
 folds it's possible to parallelize execution. Here's a draft i came with:
 {code}
 import scala.collection.immutable.{ Vector = ScalaVec }
 
 val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe
 val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex
 def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
   case ((training, validation), splitIndex) = {
 val trainingDataset = sqlCtx.applySchema(training, schema).cache()
 val validationDataset = sqlCtx.applySchema(validation, schema).cache()
 // multi-model training
 logDebug(sTrain split $splitIndex with multiple sets of parameters.)
 val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
 var i = 0
 trainingDataset.unpersist()
 while (i  numModels) {
   val metric = eval.evaluate(models(i).transform(validationDataset, 
 epm(i)), map)
   logDebug(sGot metric $metric for model trained with ${epm(i)}.)
   metrics(i) += metric
 i += 1
 }
 validationDataset.unpersist()
   }
 }
 if (parallel) 

[jira] [Created] (SPARK-5807) Parallel grid search

2015-02-13 Thread Peter Rudenko (JIRA)
Peter Rudenko created SPARK-5807:


 Summary: Parallel grid search 
 Key: SPARK-5807
 URL: https://issues.apache.org/jira/browse/SPARK-5807
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Peter Rudenko
Priority: Minor


Right now in CrossValidator for each fold combination and ParamGrid 
hyperparameter pair it searches the best parameter sequentially. Assuming 
there's enough workers  memory on a cluster to cache all training/validation 
folds it's possible to parallelize execution. Here's a draft i came with:

{code}
import scala.collection.immutable.{ Vector = ScalaVec }



val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe
val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex

def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
  case ((training, validation), splitIndex) = {
val trainingDataset = sqlCtx.applySchema(training, schema).cache()
val validationDataset = sqlCtx.applySchema(validation, schema).cache()
// multi-model training
logDebug(sTrain split $splitIndex with multiple sets of parameters.)
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
var i = 0
trainingDataset.unpersist()
while (i  numModels) {
  val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)), map)
  logDebug(sGot metric $metric for model trained with ${epm(i)}.)
  metrics(i) += metric
  i += 1
}
validationDataset.unpersist()
  }
}

if (parallel) {
  splits.par.foreach(processFold)
} else {
  splits.foreach(processFold)
}
{code}

Assuming there's 3 folds it would redundantly cache all the combinations 
(pretty much memory), so maybe it's possible to cache each fold separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-13 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320754#comment-14320754
 ] 

Tathagata Das commented on SPARK-5731:
--

This is very weird. the stream is receiving more messages that it is supposed 
to. Let me try recreating it. 

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 

[jira] [Updated] (SPARK-5730) Group methods in the generated doc for spark.ml algorithms.

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5730:
-
Assignee: Xiangrui Meng

 Group methods in the generated doc for spark.ml algorithms.
 ---

 Key: SPARK-5730
 URL: https://issues.apache.org/jira/browse/SPARK-5730
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 In spark.ml, we have params and their setters/getters. It is nice to group 
 them in the generated docs. Params should be in the top, while 
 setters/getters should be at the bottom.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5810) Maven Coordinate Inclusion failing in pySpark

2015-02-13 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-5810:
--

 Summary: Maven Coordinate Inclusion failing in pySpark
 Key: SPARK-5810
 URL: https://issues.apache.org/jira/browse/SPARK-5810
 Project: Spark
  Issue Type: Bug
  Components: Deploy, PySpark
Affects Versions: 1.3.0
Reporter: Burak Yavuz
Priority: Blocker
 Fix For: 1.3.0


When including maven coordinates to download dependencies in pyspark, pyspark 
returns a GatewayError, because it cannot read the proper port to communicate 
with the JVM. This is because pyspark relies on STDIN to read the port number 
and in the meantime Ivy prints out a whole lot of logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-13 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320749#comment-14320749
 ] 

Tathagata Das commented on SPARK-5731:
--

Let me take a pass at it.

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Commented] (SPARK-5798) Spark shell issue

2015-02-13 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320763#comment-14320763
 ] 

DeepakVohra commented on SPARK-5798:


Re-tested on local OS Oracle Linux 6.5 and did not get the Spark shell issue. 
The earlier test, which generated the Spark shell error, was on Amazon EC2.  
Issue may be closed.

 Spark shell issue
 -

 Key: SPARK-5798
 URL: https://issues.apache.org/jira/browse/SPARK-5798
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
 Environment: Spark 1.2
 Scala 2.10.4
Reporter: DeepakVohra

 The Spark shell terminates when Spark code is run indicating an issue with 
 Spark shell.
 The error is coming from the spark shell file
  
   /apachespark/spark-1.2.0-bin-cdh4/bin/spark-shell: line 48
  
   $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main
   ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5807) Parallel grid search

2015-02-13 Thread Peter Rudenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Rudenko updated SPARK-5807:
-
Description: 
Right now in CrossValidator for each fold combination and ParamGrid 
hyperparameter pair it searches the best parameter sequentially. Assuming 
there's enough workers  memory on a cluster to cache all training/validation 
folds it's possible to parallelize execution. Here's a draft i came with:

{code}
val metrics = val metrics = new ArrayBuffer[Double](numModels) with 
mutable.SynchronizedBuffer[Double]
val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex

def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
  case ((training, validation), splitIndex) = {
val trainingDataset = sqlCtx.applySchema(training, schema).cache()
val validationDataset = sqlCtx.applySchema(validation, schema).cache()
// multi-model training
logDebug(sTrain split $splitIndex with multiple sets of parameters.)
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
var i = 0
trainingDataset.unpersist()
while (i  numModels) {
  val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)), map)
  logDebug(sGot metric $metric for model trained with ${epm(i)}.)
  metrics(i) += metric
i += 1
}
validationDataset.unpersist()
  }
}

if (parallel) {
splits.par.foreach(processFold)
} else {
splits.foreach(processFold)
}
{code}

Assuming there's 3 folds it would redundantly cache all the combinations 
(pretty much memory), so maybe it's possible to cache each fold separately.

  was:
Right now in CrossValidator for each fold combination and ParamGrid 
hyperparameter pair it searches the best parameter sequentially. Assuming 
there's enough workers  memory on a cluster to cache all training/validation 
folds it's possible to parallelize execution. Here's a draft i came with:

{code}
import scala.collection.immutable.{ Vector = ScalaVec }



val metrics = ScalaVec.fill(numModels)(0.0) //Scala vector is thread safe
val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex

def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
  case ((training, validation), splitIndex) = {
val trainingDataset = sqlCtx.applySchema(training, schema).cache()
val validationDataset = sqlCtx.applySchema(validation, schema).cache()
// multi-model training
logDebug(sTrain split $splitIndex with multiple sets of parameters.)
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
var i = 0
trainingDataset.unpersist()
while (i  numModels) {
  val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)), map)
  logDebug(sGot metric $metric for model trained with ${epm(i)}.)
  metrics(i) += metric
i += 1
}
validationDataset.unpersist()
  }
}

if (parallel) {
splits.par.foreach(processFold)
} else {
splits.foreach(processFold)
}
{code}

Assuming there's 3 folds it would redundantly cache all the combinations 
(pretty much memory), so maybe it's possible to cache each fold separately.


 Parallel grid search 
 -

 Key: SPARK-5807
 URL: https://issues.apache.org/jira/browse/SPARK-5807
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Peter Rudenko
Priority: Minor

 Right now in CrossValidator for each fold combination and ParamGrid 
 hyperparameter pair it searches the best parameter sequentially. Assuming 
 there's enough workers  memory on a cluster to cache all training/validation 
 folds it's possible to parallelize execution. Here's a draft i came with:
 {code}
 val metrics = val metrics = new ArrayBuffer[Double](numModels) with 
 mutable.SynchronizedBuffer[Double]
 val splits = MLUtils.kFold(dataset, map(numFolds), 0).zipWithIndex
 def processFold(input: ((RDD[sql.Row], RDD[sql.Row]), Int)) = input match {
   case ((training, validation), splitIndex) = {
 val trainingDataset = sqlCtx.applySchema(training, schema).cache()
 val validationDataset = sqlCtx.applySchema(validation, schema).cache()
 // multi-model training
 logDebug(sTrain split $splitIndex with multiple sets of parameters.)
 val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
 var i = 0
 trainingDataset.unpersist()
 while (i  numModels) {
   val metric = eval.evaluate(models(i).transform(validationDataset, 
 epm(i)), map)
   logDebug(sGot metric $metric for model trained with ${epm(i)}.)
   metrics(i) += metric
 i += 1
 }
 validationDataset.unpersist()
   }
 }
 if (parallel) {
 splits.par.foreach(processFold)
 } else {
 splits.foreach(processFold)
 }
 {code}
 Assuming there's 3 folds it would redundantly cache all the combinations 

[jira] [Commented] (SPARK-5731) Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic stream receiving with multiple topics and smallest starting offset

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320796#comment-14320796
 ] 

Apache Spark commented on SPARK-5731:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4597

 Flaky Test: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.basic 
 stream receiving with multiple topics and smallest starting offset
 

 Key: SPARK-5731
 URL: https://issues.apache.org/jira/browse/SPARK-5731
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Blocker
  Labels: flaky-test

 {code}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 110 times over 20.070287525 seconds. Last failure 
 message: 300 did not equal 48 didn't get all messages.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply$mcV$sp(DirectKafkaStreamSuite.scala:110)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$2.apply(DirectKafkaStreamSuite.scala:70)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:38)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$run(DirectKafkaStreamSuite.scala:38)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfterAll$$super$run(DirectKafkaStreamSuite.scala:38)
   at 
 

[jira] [Commented] (SPARK-5779) Python broadcast does not work with Kryo serializer

2015-02-13 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320803#comment-14320803
 ] 

Josh Rosen commented on SPARK-5779:
---

I thought we fixed this in SPARK-4882: 
https://github.com/apache/spark/pull/3831.  Have you observed a new version of 
this issue?

 Python broadcast does not work with Kryo serializer
 ---

 Key: SPARK-5779
 URL: https://issues.apache.org/jira/browse/SPARK-5779
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1
Reporter: Davies Liu
Priority: Critical

 The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4865) Include temporary tables in SHOW TABLES

2015-02-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320804#comment-14320804
 ] 

Yin Huai commented on SPARK-4865:
-

I will start to work on it based on SPARK-3299.

 Include temporary tables in SHOW TABLES
 ---

 Key: SPARK-4865
 URL: https://issues.apache.org/jira/browse/SPARK-4865
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Misha Chernetsov
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4865) Include temporary tables in SHOW TABLES

2015-02-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4865:

Priority: Blocker  (was: Critical)

 Include temporary tables in SHOW TABLES
 ---

 Key: SPARK-4865
 URL: https://issues.apache.org/jira/browse/SPARK-4865
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Misha Chernetsov
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala

2015-02-13 Thread Devesh Parekh (JIRA)
Devesh Parekh created SPARK-5809:


 Summary: OutOfMemoryError in logDebug in RandomForest.scala
 Key: SPARK-5809
 URL: https://issues.apache.org/jira/browse/SPARK-5809
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Devesh Parekh


When training a GBM on sparse vectors produced by HashingTF, I get the 
following OutOfMemoryError, where RandomForest is building a debug string to 
log.

Exception in thread main java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3326)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at 
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at 
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
at scala.collection.AbstractTraversable.addString(Traversable.scala:105)
at 
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at 
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at 
org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
at 
org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
at 
org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
 
at 
org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)

A workaround until this is fixed is to modify log4j.properties in the conf 
directory to filter out debug logs in RandomForest. For example:
log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5789) Throw a better error message if JsonRDD.parseJson encounters unrecoverable parsing errors.

2015-02-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5789.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4582
[https://github.com/apache/spark/pull/4582]

 Throw a better error message if JsonRDD.parseJson encounters unrecoverable 
 parsing errors.
 --

 Key: SPARK-5789
 URL: https://issues.apache.org/jira/browse/SPARK-5789
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
 Fix For: 1.3.0


 For example
 {code}
 sqlContext.jsonRDD(sc.parallelize(a:1}::Nil))
 {code}
 will throw
 {code}
 scala.MatchError: a (of class java.lang.String)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/02/12 15:08:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
 4.0 (TID 26) in 10 ms on localhost (7/8)
 15/02/12 15:08:55 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 4.0 
 (TID 33, localhost): scala.MatchError: a (of class java.lang.String)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5811) Documentation for --packages and --repositories on Spark Shell

2015-02-13 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-5811:
--

 Summary: Documentation for --packages and --repositories on Spark 
Shell
 Key: SPARK-5811
 URL: https://issues.apache.org/jira/browse/SPARK-5811
 Project: Spark
  Issue Type: Documentation
  Components: Deploy, Spark Shell
Affects Versions: 1.3.0
Reporter: Burak Yavuz
Priority: Critical
 Fix For: 1.3.0


Documentation for the new support for dependency management using maven 
coordinates using --packages and --repositories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5363) Spark 1.2 freeze without error notification

2015-02-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320986#comment-14320986
 ] 

Apache Spark commented on SPARK-5363:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4601

 Spark 1.2 freeze without error notification
 ---

 Key: SPARK-5363
 URL: https://issues.apache.org/jira/browse/SPARK-5363
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Tassilo Klein
Assignee: Davies Liu
Priority: Critical

 After a number of calls to a map().collect() statement Spark freezes without 
 reporting any error.  Within the map a large broadcast variable is used.
 The freezing can be avoided by setting 'spark.python.worker.reuse = false' 
 (Spark 1.2) or using an earlier version, however, at the prize of low speed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320995#comment-14320995
 ] 

Florian Verhein commented on SPARK-3821:


RE: Java, that reminds me... We should probably be using OracleJDK rather than 
OpenJDK. But I think this should be a separate issue, so just created 
#SPARK-5813.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5730) Group methods in the generated doc for spark.ml algorithms.

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5730.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4600
[https://github.com/apache/spark/pull/4600]

 Group methods in the generated doc for spark.ml algorithms.
 ---

 Key: SPARK-5730
 URL: https://issues.apache.org/jira/browse/SPARK-5730
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 In spark.ml, we have params and their setters/getters. It is nice to group 
 them in the generated docs. Params should be in the top, while 
 setters/getters should be at the bottom.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-13 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321099#comment-14321099
 ] 

Travis Galoppo commented on SPARK-5016:
---

Hmm. I'm having trouble conceptualizing how to use aggregateByKey here; the 
breezeData RDD is not keyed.  We could have a keyed RDD of expectation sums 
(with a little rework), but each entry in the breezeData RDD would need to be 
operated on by each reducer (which seems awkward?)... or I'm way off?  


 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5803) Use ArrayBuilder instead of ArrayBuffer for primitive types

2015-02-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5803.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Use ArrayBuilder instead of ArrayBuffer for primitive types
 ---

 Key: SPARK-5803
 URL: https://issues.apache.org/jira/browse/SPARK-5803
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 ArrayBuffer is not specialized and hence it boxes primitive-typed values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5363) Spark 1.2 freeze without error notification

2015-02-13 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320987#comment-14320987
 ] 

Davies Liu commented on SPARK-5363:
---

[~TJKlein] Could you try the patch in https://github.com/apache/spark/pull/4601 
whether it fix your problem?

 Spark 1.2 freeze without error notification
 ---

 Key: SPARK-5363
 URL: https://issues.apache.org/jira/browse/SPARK-5363
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Tassilo Klein
Assignee: Davies Liu
Priority: Critical

 After a number of calls to a map().collect() statement Spark freezes without 
 reporting any error.  Within the map a large broadcast variable is used.
 The freezing can be avoided by setting 'spark.python.worker.reuse = false' 
 (Spark 1.2) or using an earlier version, however, at the prize of low speed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-13 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5813:
--

 Summary: Spark-ec2: Switch to OracleJDK
 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor


Currently using OpenJDK, however it is generally recommended to use Oracle JDK, 
esp for Hadoop deployments, etc. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5814) Remove JBLAS from runtime dependencies

2015-02-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5814:


 Summary: Remove JBLAS from runtime dependencies
 Key: SPARK-5814
 URL: https://issues.apache.org/jira/browse/SPARK-5814
 Project: Spark
  Issue Type: Dependency upgrade
  Components: GraphX, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We are using mixed breeze/netlib-java and jblas code in MLlib. They take 
different approaches to utilize native libraries and we should keep only one of 
them. netlib-java has a clear separation between Java implementation and native 
JNI libraries, while JBLAS packs statically linked binaries that causes license 
issues (SPARK-5669). So we want to remove JBLAS from Spark runtime.

One issue with this approach is that we have JBLAS' DoubleMatrix exposed (by 
mistake) in SVDPlusPlus of GraphX. We should deprecate it and replace 
`DoubleMatrix` by `Array[Double]`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5815) Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS

2015-02-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5815:


 Summary: Deprecate SVDPlusPlus APIs that expose DoubleMatrix from 
JBLAS
 Key: SPARK-5815
 URL: https://issues.apache.org/jira/browse/SPARK-5815
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is generally bad to expose types defined in a 3rd-party package in Spark 
public APIs. We should deprecate those methods in SVDPlusPlus and replace them 
in the next release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5124) Standardize internal RPC interface

2015-02-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-5124:

Attachment: Pluggable RPC - draft 2.pdf

Comparing to the first version, this docs adds ActionScheduler interface and 
change the fault tolerance to:

Any error thrown by `onStart`, `receive` and `onStop` will be sent to 
`onError`. If onError throws an error, it will be ignored.

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >