[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2015-02-24 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336169#comment-14336169
 ] 

Derrick Burns commented on SPARK-3261:
--

One solution is to run KMeansParallel or KMeansRandom after each Lloyds round 
to "replenish" empty clusters.

I have implemented the former in 
https://github.com/derrickburns/generalized-kmeans-clustering.

Performance is reasonable. 

Inspection reveals that the slow part of the KMeansParallel computation is the 
computation of the sum of the weights of the points in each cluster.  

However, the performance can be reduced by sampling the points and summing the 
contributions of each sampled point. For large data sets, this approach is 
appropriate.  

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-24 Thread Anselme Vignon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336152#comment-14336152
 ] 

Anselme Vignon commented on SPARK-5775:
---

[~marmbrus][~lian cheng] Hi,

I'm quite new on the process of debugging spark, but the pull request I updated 
5 days ago (referenced above) seems to be solving this issue.

Or did I miss something ? 

Cheers (and thanks for the awesome work),

Anselme

> GenericRow cannot be cast to SpecificMutableRow when nested data and 
> partitioned table
> --
>
> Key: SPARK-5775
> URL: https://issues.apache.org/jira/browse/SPARK-5775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Ayoub Benali
>Assignee: Cheng Lian
>Priority: Blocker
>  Labels: hivecontext, nested, parquet, partition
>
> Using the "LOAD" sql command in Hive context to load parquet files into a 
> partitioned table causes exceptions during query time. 
> The bug requires the table to have a column of *type Array of struct* and to 
> be *partitioned*. 
> The example bellow shows how to reproduce the bug and you can see that if the 
> table is not partitioned the query works fine. 
> {noformat}
> scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}"""
> scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}"""
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD)
> scala> schemaRDD.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
> scala> hiveContext.sql("create external table if not exists 
> partitioned_table(data_array ARRAY >) 
> Partitioned by (date STRING) STORED AS PARQUET Location 
> 'hdfs:///partitioned_table'")
> scala> hiveContext.sql("create external table if not exists 
> none_partitioned_table(data_array ARRAY >) 
> STORED AS PARQUET Location 'hdfs:///none_partitioned_table'")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
> partitioned_table PARTITION(date='2015-02-12')")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
> none_partitioned_table")
> scala> hiveContext.sql("select data.field1 from none_partitioned_table 
> LATERAL VIEW explode(data_array) nestedStuff AS data").collect
> res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL 
> VIEW explode(data_array) nestedStuff AS data").collect
> 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
> partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
> 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
> curMem=0, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 254.6 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
> curMem=260661, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
> in memory (estimated size 27.9 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on *:51990 (size: 27.9 KB, free: 267.2 MB)
> 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
> broadcast_18_piece0
> 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
> at ParquetTableOperations.scala:119
> 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
> Metadata Split Strategy
> 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
> SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
> 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
> SparkPlan.scala:84)
> 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
> map at SparkPlan.scala:84), which has no missing parents
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
> 

[jira] [Commented] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336128#comment-14336128
 ] 

Apache Spark commented on SPARK-5969:
-

User 'foxik' has created a pull request for this issue:
https://github.com/apache/spark/pull/4761

> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False.
> 
>
> Key: SPARK-5969
> URL: https://issues.apache.org/jira/browse/SPARK-5969
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.1
> Environment: Linux, 64bit
>Reporter: Milan Straka
>
> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False -- the first one and the last one.
> Simple example sorting numbers 0..999 into 10 partitions in descending order:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, 
> numPartitions=10).glom().map(len).collect()
> {code}
> returns the following partition sizes:
> {code}
> [469, 0, 0, 0, 0, 0, 0, 0, 0, 531]
> {code}
> When ascending=True, all works as expected:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, 
> numPartitions=10).glom().map(len).collect()
> Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85]
> {code}
> The problem is caused by the following line 565 in rdd.py:
> {code}
> samples = sorted(samples, reverse=(not ascending), key=keyfunc)
> {code}
> That sorts the samples descending if ascending=False. Nevertheless samples 
> should always be in ascending order, because it is (after subsampling to 
> variable bounds) used in a bisect_left call:
> {code}
> def rangePartitioner(k):
> p = bisect.bisect_left(bounds, keyfunc(k))
> if ascending:
> return p
> else:
> return numPartitions - 1 - p
> {code}
> As you can see, rangePartitioner already handles the ascending=False by 
> itself, so the fix for the whole problem is really trivial: just change line 
> rdd.py:565 to
> {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5999) Remove duplicate Literal matching block

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336120#comment-14336120
 ] 

Apache Spark commented on SPARK-5999:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4760

> Remove duplicate Literal matching block
> ---
>
> Key: SPARK-5999
> URL: https://issues.apache.org/jira/browse/SPARK-5999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5999) Remove duplicate Literal matching block

2015-02-24 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-5999:
--

 Summary: Remove duplicate Literal matching block
 Key: SPARK-5999
 URL: https://issues.apache.org/jira/browse/SPARK-5999
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5970) Temporary directories are not removed (but their content is)

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336113#comment-14336113
 ] 

Apache Spark commented on SPARK-5970:
-

User 'foxik' has created a pull request for this issue:
https://github.com/apache/spark/pull/4759

> Temporary directories are not removed (but their content is)
> 
>
> Key: SPARK-5970
> URL: https://issues.apache.org/jira/browse/SPARK-5970
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Linux, 64bit
> spark-1.2.1-bin-hadoop2.4.tgz
>Reporter: Milan Straka
>
> How to reproduce: 
> - extract spark-1.2.1-bin-hadoop2.4.tgz
> - without any further configuration, run bin/pyspark
> - run sc.stop() and close python shell
> Expected results:
> - no temporary directories are left in /tmp
> Actual results:
> - four empty temporary directories are created in /tmp, for example after 
> {{ls -d /tmp/spark*}}:{code}
> /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53
> /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3
> /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753
> /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146
> {code}
> The issue is caused by changes in {{util/Utils.scala}}. Consider the 
> {{createDirectory}}:
> {code}  /**
>* Create a directory inside the given parent directory. The directory is 
> guaranteed to be
>* newly created, and is not marked for automatic deletion.
>*/
>   def createDirectory(root: String, namePrefix: String = "spark"): File = ...
> {code}
> The {{createDirectory}} is used in two places. The first is in 
> {{createTempDir}}, where it is marked for automatic deletion:
> {code}
>   def createTempDir(
>   root: String = System.getProperty("java.io.tmpdir"),
>   namePrefix: String = "spark"): File = {
> val dir = createDirectory(root, namePrefix)
> registerShutdownDeleteDir(dir)
> dir
>   }
> {code}
> Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ 
> marked for automatic deletion:
> {code}
>   private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] 
> = {
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so 
> we must set it
>   // to what Yarn on this system said was available. Note this assumes 
> that Yarn has
>   // created the directories already, and that they are secured so that 
> only the
>   // user has access to them.
>   getYarnLocalDirs(conf).split(",")
> } else {
>   // In non-Yarn mode (or for the driver in yarn-client mode), we cannot 
> trust the user
>   // configuration to point to a secure directory. So create a 
> subdirectory with restricted
>   // permissions under each listed directory.
>   Option(conf.getenv("SPARK_LOCAL_DIRS"))
> .getOrElse(conf.get("spark.local.dir", 
> System.getProperty("java.io.tmpdir")))
> .split(",")
> .flatMap { root =>
>   try {
> val rootDir = new File(root)
> if (rootDir.exists || rootDir.mkdirs()) {
>   Some(createDirectory(root).getAbsolutePath())
> } else {
>   logError(s"Failed to create dir in $root. Ignoring this 
> directory.")
>   None
> }
>   } catch {
> case e: IOException =>
> logError(s"Failed to create local root dir in $root. Ignoring 
> this directory.")
> None
>   }
> }
> .toArray
> }
>   }
> {code}
> Therefore I think the
> {code}
> Some(createDirectory(root).getAbsolutePath())
> {code}
> should be replaced by something like (I am not an experienced Scala 
> programmer):
> {code}
> val dir = createDirectory(root)
> registerShutdownDeleteDir(dir)
> Some(dir.getAbsolutePath())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5970) Temporary directories are not removed (but their content is)

2015-02-24 Thread Milan Straka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336070#comment-14336070
 ] 

Milan Straka commented on SPARK-5970:
-

I found a _Contributing to Spark_ guide, will do so.

> Temporary directories are not removed (but their content is)
> 
>
> Key: SPARK-5970
> URL: https://issues.apache.org/jira/browse/SPARK-5970
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Linux, 64bit
> spark-1.2.1-bin-hadoop2.4.tgz
>Reporter: Milan Straka
>
> How to reproduce: 
> - extract spark-1.2.1-bin-hadoop2.4.tgz
> - without any further configuration, run bin/pyspark
> - run sc.stop() and close python shell
> Expected results:
> - no temporary directories are left in /tmp
> Actual results:
> - four empty temporary directories are created in /tmp, for example after 
> {{ls -d /tmp/spark*}}:{code}
> /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53
> /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3
> /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753
> /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146
> {code}
> The issue is caused by changes in {{util/Utils.scala}}. Consider the 
> {{createDirectory}}:
> {code}  /**
>* Create a directory inside the given parent directory. The directory is 
> guaranteed to be
>* newly created, and is not marked for automatic deletion.
>*/
>   def createDirectory(root: String, namePrefix: String = "spark"): File = ...
> {code}
> The {{createDirectory}} is used in two places. The first is in 
> {{createTempDir}}, where it is marked for automatic deletion:
> {code}
>   def createTempDir(
>   root: String = System.getProperty("java.io.tmpdir"),
>   namePrefix: String = "spark"): File = {
> val dir = createDirectory(root, namePrefix)
> registerShutdownDeleteDir(dir)
> dir
>   }
> {code}
> Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ 
> marked for automatic deletion:
> {code}
>   private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] 
> = {
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so 
> we must set it
>   // to what Yarn on this system said was available. Note this assumes 
> that Yarn has
>   // created the directories already, and that they are secured so that 
> only the
>   // user has access to them.
>   getYarnLocalDirs(conf).split(",")
> } else {
>   // In non-Yarn mode (or for the driver in yarn-client mode), we cannot 
> trust the user
>   // configuration to point to a secure directory. So create a 
> subdirectory with restricted
>   // permissions under each listed directory.
>   Option(conf.getenv("SPARK_LOCAL_DIRS"))
> .getOrElse(conf.get("spark.local.dir", 
> System.getProperty("java.io.tmpdir")))
> .split(",")
> .flatMap { root =>
>   try {
> val rootDir = new File(root)
> if (rootDir.exists || rootDir.mkdirs()) {
>   Some(createDirectory(root).getAbsolutePath())
> } else {
>   logError(s"Failed to create dir in $root. Ignoring this 
> directory.")
>   None
> }
>   } catch {
> case e: IOException =>
> logError(s"Failed to create local root dir in $root. Ignoring 
> this directory.")
> None
>   }
> }
> .toArray
> }
>   }
> {code}
> Therefore I think the
> {code}
> Some(createDirectory(root).getAbsolutePath())
> {code}
> should be replaced by something like (I am not an experienced Scala 
> programmer):
> {code}
> val dir = createDirectory(root)
> registerShutdownDeleteDir(dir)
> Some(dir.getAbsolutePath())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2015-02-24 Thread Saurabh Santhosh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336057#comment-14336057
 ] 

Saurabh Santhosh commented on SPARK-1823:
-

What is the status of this issue? Is there any other ticket for this?

> ExternalAppendOnlyMap can still OOM if one key is very large
> 
>
> Key: SPARK-1823
> URL: https://issues.apache.org/jira/browse/SPARK-1823
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Or
>
> If the values for one key do not collectively fit into memory, then the map 
> will still OOM when you merge the spilled contents back in.
> This is a problem especially for PySpark, since we hash the keys (Python 
> objects) before a shuffle, and there are only so many integers out there in 
> the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-02-24 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336034#comment-14336034
 ] 

Twinkle Sachdeva commented on SPARK-4705:
-

Hi [~vanzin]

Working on it.

Thanks,
Twinkle

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - 
> II.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-02-24 Thread Mukesh Jha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336027#comment-14336027
 ] 

Mukesh Jha commented on SPARK-5837:
---

My Hadoop version is Hadoop 2.5.0-cdh5.3.0

> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>   at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>   at 
> org.mortbay.je

[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-02-24 Thread Mukesh Jha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336026#comment-14336026
 ] 

Mukesh Jha commented on SPARK-5837:
---

[~srowen] From the Driver logs [3] I can see that SparkUI started on a 
specified port, also my YARN app tracking URL[1] points to that port which is 
in turn getting redirected to the proxy URL[2] which gives me 
java.net.BindException: Cannot assign requested address.
If there was a port conflict issue the sparkUI stark will have issues but that 
id not the case.

[1] YARN: 
  application_1424814313649_0006spark-realtime-MessageStoreWriter   
   SPARKciuser  root.ciuserRUNNING  
 UNDEFINED  10% http://host21.cloud.com:44648
[2] ProxyURL: 
 http://host28.cloud.com:8088/proxy/application_1424814313649_0006/
[3] LOGS:
15/02/25 04:25:02 INFO util.Utils: Successfully started service 'SparkUI' on 
port 44648.
15/02/25 04:25:02 INFO ui.SparkUI: Started SparkUI at 
http://host21.cloud.com:44648
15/02/25 04:25:02 INFO cluster.YarnClusterScheduler: Created 
YarnClusterScheduler
15/02/25 04:25:02 INFO netty.NettyBlockTransferService: Server created on 41518

> HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
> --
>
> Key: SPARK-5837
> URL: https://issues.apache.org/jira/browse/SPARK-5837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Marco Capuccini
>
> Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
> Spark UI if I run over yarn (version 2.4.0):
> HTTP ERROR 500
> Problem accessing /proxy/application_1423564210894_0017/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:280)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>   at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>   at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
>   at 
> org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
>   at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>   at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>   at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFi

[jira] [Resolved] (SPARK-5994) Python DataFrame documentation fixes

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5994.
-
Resolution: Fixed

Issue resolved by pull request 4756
[https://github.com/apache/spark/pull/4756]

> Python DataFrame documentation fixes
> 
>
> Key: SPARK-5994
> URL: https://issues.apache.org/jira/browse/SPARK-5994
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>
> select empty should NOT be the same as select(*). make sure selectExpr is 
> behaving the same.
> join param documentation
> link to source doesn't work in jekyll generated file
> cross reference of columns (i.e. enabling linking)
> show(): move df example before df.show()
> move tests in SQLContext out of docstring otherwise doc is too long
> Column.desc and .asc doesn't have any documentation
> in documentation, sort functions.* (and add the functions defined in the dict 
> automatically)
> if possible, add index for all functions and classes
> GroupedData doesn't have any description in the table of contents for 
> pyspark.sql module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335987#comment-14335987
 ] 

Apache Spark commented on SPARK-5751:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4758

> Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes 
> times out
> --
>
> Key: SPARK-5751
> URL: https://issues.apache.org/jira/browse/SPARK-5751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.3.0
>
>
> The "Test JDBC query execution" test case times out occasionally, all other 
> test cases are just fine. The failure output only contains service startup 
> command line without any log output. Guess somehow the test case misses the 
> log file path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5998) Make Spark Streaming checkpoint version compatible

2015-02-24 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5998:
--

 Summary: Make Spark Streaming checkpoint version compatible
 Key: SPARK-5998
 URL: https://issues.apache.org/jira/browse/SPARK-5998
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao


Currently Spark Streaming's checkpointing(serialization) mechanism is version 
mutable, if the code is updated or any changes happened. The old  checkpointing 
files cannot be read again. To keep the long-running and upgradability of 
streaming App, making the checkpointing mechanism version compatible is meaning.

Here creating a JIRA as stub of this issue, I'm currently working on this, a 
design doc will be posted later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5286) Fail to drop an invalid table when using the data source API

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5286.
-
Resolution: Fixed

Issue resolved by pull request 4755
[https://github.com/apache/spark/pull/4755]

> Fail to drop an invalid table when using the data source API
> 
>
> Key: SPARK-5286
> URL: https://issues.apache.org/jira/browse/SPARK-5286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Example
> {code}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
>   path 'it is not a path at all!'
> )
> DROP TABLE jsonTable
> {code}
> We will get 
> {code}
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
> [info]   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
> [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
> [info]   at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> [info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> [info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.

[jira] [Commented] (SPARK-5996) DataFrame.collect() doesn't recognize UDTs

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335947#comment-14335947
 ] 

Apache Spark commented on SPARK-5996:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/4757

> DataFrame.collect() doesn't recognize UDTs
> --
>
> Key: SPARK-5996
> URL: https://issues.apache.org/jira/browse/SPARK-5996
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, SQL
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Michael Armbrust
>Priority: Blocker
>
> {code}
> import org.apache.spark.mllib.linalg._
> case class Test(data: Vector)
> val df = sqlContext.createDataFrame(Seq(Test(Vectors.dense(1.0, 2.0
> df.collect()[0].getAs[Vector](0)
> {code}
> throws an exception. `collect()` returns `Row` instead of `Vector`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5775:

Assignee: Cheng Lian  (was: Michael Armbrust)

> GenericRow cannot be cast to SpecificMutableRow when nested data and 
> partitioned table
> --
>
> Key: SPARK-5775
> URL: https://issues.apache.org/jira/browse/SPARK-5775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Ayoub Benali
>Assignee: Cheng Lian
>Priority: Blocker
>  Labels: hivecontext, nested, parquet, partition
>
> Using the "LOAD" sql command in Hive context to load parquet files into a 
> partitioned table causes exceptions during query time. 
> The bug requires the table to have a column of *type Array of struct* and to 
> be *partitioned*. 
> The example bellow shows how to reproduce the bug and you can see that if the 
> table is not partitioned the query works fine. 
> {noformat}
> scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}"""
> scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}"""
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD)
> scala> schemaRDD.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
> scala> hiveContext.sql("create external table if not exists 
> partitioned_table(data_array ARRAY >) 
> Partitioned by (date STRING) STORED AS PARQUET Location 
> 'hdfs:///partitioned_table'")
> scala> hiveContext.sql("create external table if not exists 
> none_partitioned_table(data_array ARRAY >) 
> STORED AS PARQUET Location 'hdfs:///none_partitioned_table'")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
> partitioned_table PARTITION(date='2015-02-12')")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
> none_partitioned_table")
> scala> hiveContext.sql("select data.field1 from none_partitioned_table 
> LATERAL VIEW explode(data_array) nestedStuff AS data").collect
> res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL 
> VIEW explode(data_array) nestedStuff AS data").collect
> 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
> partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
> 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
> curMem=0, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 254.6 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
> curMem=260661, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
> in memory (estimated size 27.9 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on *:51990 (size: 27.9 KB, free: 267.2 MB)
> 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
> broadcast_18_piece0
> 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
> at ParquetTableOperations.scala:119
> 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
> Metadata Split Strategy
> 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
> SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
> 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
> SparkPlan.scala:84)
> 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
> map at SparkPlan.scala:84), which has no missing parents
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
> curMem=289276, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 7.5 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
> curMem=296908, maxMem=28

[jira] [Updated] (SPARK-5993) Published Kafka-assembly JAR was empty in 1.3.0-RC1

2015-02-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5993:
-
Fix Version/s: 1.3.0

> Published Kafka-assembly JAR was empty in 1.3.0-RC1
> ---
>
> Key: SPARK-5993
> URL: https://issues.apache.org/jira/browse/SPARK-5993
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.3.0
>
>
> This is because the maven build generated two Jars-
> 1. an empty JAR file (since kafka-assembly has no code of its own)
> 2. a assembly JAR file containing everything in a different location as 1
> The maven publishing plugin uploaded 1 and not 2. 
> Instead if 2 is not configure to generate in a different location, there is 
> only 1 jar containing everything, which gets published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5775:

Priority: Blocker  (was: Major)

> GenericRow cannot be cast to SpecificMutableRow when nested data and 
> partitioned table
> --
>
> Key: SPARK-5775
> URL: https://issues.apache.org/jira/browse/SPARK-5775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Ayoub Benali
>Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: hivecontext, nested, parquet, partition
>
> Using the "LOAD" sql command in Hive context to load parquet files into a 
> partitioned table causes exceptions during query time. 
> The bug requires the table to have a column of *type Array of struct* and to 
> be *partitioned*. 
> The example bellow shows how to reproduce the bug and you can see that if the 
> table is not partitioned the query works fine. 
> {noformat}
> scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}"""
> scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}"""
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD)
> scala> schemaRDD.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
> scala> hiveContext.sql("create external table if not exists 
> partitioned_table(data_array ARRAY >) 
> Partitioned by (date STRING) STORED AS PARQUET Location 
> 'hdfs:///partitioned_table'")
> scala> hiveContext.sql("create external table if not exists 
> none_partitioned_table(data_array ARRAY >) 
> STORED AS PARQUET Location 'hdfs:///none_partitioned_table'")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
> partitioned_table PARTITION(date='2015-02-12')")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
> none_partitioned_table")
> scala> hiveContext.sql("select data.field1 from none_partitioned_table 
> LATERAL VIEW explode(data_array) nestedStuff AS data").collect
> res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL 
> VIEW explode(data_array) nestedStuff AS data").collect
> 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
> partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
> 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
> curMem=0, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 254.6 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
> curMem=260661, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
> in memory (estimated size 27.9 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on *:51990 (size: 27.9 KB, free: 267.2 MB)
> 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
> broadcast_18_piece0
> 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
> at ParquetTableOperations.scala:119
> 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
> Metadata Split Strategy
> 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
> SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
> 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
> SparkPlan.scala:84)
> 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
> map at SparkPlan.scala:84), which has no missing parents
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
> curMem=289276, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 7.5 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
> curMem=296908, maxMem=280248975

[jira] [Assigned] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-5775:
---

Assignee: Michael Armbrust

> GenericRow cannot be cast to SpecificMutableRow when nested data and 
> partitioned table
> --
>
> Key: SPARK-5775
> URL: https://issues.apache.org/jira/browse/SPARK-5775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Ayoub Benali
>Assignee: Michael Armbrust
>  Labels: hivecontext, nested, parquet, partition
>
> Using the "LOAD" sql command in Hive context to load parquet files into a 
> partitioned table causes exceptions during query time. 
> The bug requires the table to have a column of *type Array of struct* and to 
> be *partitioned*. 
> The example bellow shows how to reproduce the bug and you can see that if the 
> table is not partitioned the query works fine. 
> {noformat}
> scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}"""
> scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}"""
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD)
> scala> schemaRDD.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
> scala> hiveContext.sql("create external table if not exists 
> partitioned_table(data_array ARRAY >) 
> Partitioned by (date STRING) STORED AS PARQUET Location 
> 'hdfs:///partitioned_table'")
> scala> hiveContext.sql("create external table if not exists 
> none_partitioned_table(data_array ARRAY >) 
> STORED AS PARQUET Location 'hdfs:///none_partitioned_table'")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
> partitioned_table PARTITION(date='2015-02-12')")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
> none_partitioned_table")
> scala> hiveContext.sql("select data.field1 from none_partitioned_table 
> LATERAL VIEW explode(data_array) nestedStuff AS data").collect
> res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL 
> VIEW explode(data_array) nestedStuff AS data").collect
> 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
> partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
> 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
> curMem=0, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 254.6 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
> curMem=260661, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
> in memory (estimated size 27.9 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on *:51990 (size: 27.9 KB, free: 267.2 MB)
> 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
> broadcast_18_piece0
> 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
> at ParquetTableOperations.scala:119
> 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
> Metadata Split Strategy
> 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
> SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
> 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
> SparkPlan.scala:84)
> 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
> map at SparkPlan.scala:84), which has no missing parents
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
> curMem=289276, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 7.5 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
> curMem=296908, maxMem=280248975
> 15/02/12 16:21:03 INFO Memo

[jira] [Updated] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5775:

Target Version/s: 1.3.0

> GenericRow cannot be cast to SpecificMutableRow when nested data and 
> partitioned table
> --
>
> Key: SPARK-5775
> URL: https://issues.apache.org/jira/browse/SPARK-5775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Ayoub Benali
>Assignee: Michael Armbrust
>  Labels: hivecontext, nested, parquet, partition
>
> Using the "LOAD" sql command in Hive context to load parquet files into a 
> partitioned table causes exceptions during query time. 
> The bug requires the table to have a column of *type Array of struct* and to 
> be *partitioned*. 
> The example bellow shows how to reproduce the bug and you can see that if the 
> table is not partitioned the query works fine. 
> {noformat}
> scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}"""
> scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}"""
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD)
> scala> schemaRDD.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
> scala> hiveContext.sql("create external table if not exists 
> partitioned_table(data_array ARRAY >) 
> Partitioned by (date STRING) STORED AS PARQUET Location 
> 'hdfs:///partitioned_table'")
> scala> hiveContext.sql("create external table if not exists 
> none_partitioned_table(data_array ARRAY >) 
> STORED AS PARQUET Location 'hdfs:///none_partitioned_table'")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
> partitioned_table PARTITION(date='2015-02-12')")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
> none_partitioned_table")
> scala> hiveContext.sql("select data.field1 from none_partitioned_table 
> LATERAL VIEW explode(data_array) nestedStuff AS data").collect
> res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL 
> VIEW explode(data_array) nestedStuff AS data").collect
> 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
> partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
> 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
> curMem=0, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 254.6 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
> curMem=260661, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
> in memory (estimated size 27.9 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on *:51990 (size: 27.9 KB, free: 267.2 MB)
> 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
> broadcast_18_piece0
> 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
> at ParquetTableOperations.scala:119
> 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
> Metadata Split Strategy
> 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
> SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
> 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
> SparkPlan.scala:84)
> 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
> map at SparkPlan.scala:84), which has no missing parents
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
> curMem=289276, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 7.5 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
> curMem=296908, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: B

[jira] [Resolved] (SPARK-5985) sortBy -> orderBy in Python

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5985.
-
Resolution: Fixed

Issue resolved by pull request 4752
[https://github.com/apache/spark/pull/4752]

> sortBy -> orderBy in Python
> ---
>
> Key: SPARK-5985
> URL: https://issues.apache.org/jira/browse/SPARK-5985
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Seems like a mistake that we included sortBy but not orderBy in Python 
> DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-02-24 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335846#comment-14335846
 ] 

Chris Fregly commented on SPARK-5960:
-

pushing this up to 1.3.1

> Allow AWS credentials to be passed to KinesisUtils.createStream()
> -
>
> Key: SPARK-5960
> URL: https://issues.apache.org/jira/browse/SPARK-5960
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> While IAM roles are preferable, we're seeing a lot of cases where we need to 
> pass AWS credentials when creating the KinesisReceiver.
> Notes:
> * Make sure we don't log the credentials anywhere
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-02-24 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5960:

Target Version/s: 1.3.1  (was: 1.4.0)

> Allow AWS credentials to be passed to KinesisUtils.createStream()
> -
>
> Key: SPARK-5960
> URL: https://issues.apache.org/jira/browse/SPARK-5960
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> While IAM roles are preferable, we're seeing a lot of cases where we need to 
> pass AWS credentials when creating the KinesisReceiver.
> Notes:
> * Make sure we don't log the credentials anywhere
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5994) Python DataFrame documentation fixes

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335845#comment-14335845
 ] 

Apache Spark commented on SPARK-5994:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4756

> Python DataFrame documentation fixes
> 
>
> Key: SPARK-5994
> URL: https://issues.apache.org/jira/browse/SPARK-5994
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>
> select empty should NOT be the same as select(*). make sure selectExpr is 
> behaving the same.
> join param documentation
> link to source doesn't work in jekyll generated file
> cross reference of columns (i.e. enabling linking)
> show(): move df example before df.show()
> move tests in SQLContext out of docstring otherwise doc is too long
> Column.desc and .asc doesn't have any documentation
> in documentation, sort functions.* (and add the functions defined in the dict 
> automatically)
> if possible, add index for all functions and classes
> GroupedData doesn't have any description in the table of contents for 
> pyspark.sql module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5997) Increase partition count without performing a shuffle

2015-02-24 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-5997:
--
Description: 
When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
user has the ability to choose whether or not to perform a shuffle.  However 
when increasing partition count there is no option of whether to perform a 
shuffle or not -- a shuffle always occurs.

This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that 
performs a repartition to a higher partition count without a shuffle.

The motivating use case is to decrease the size of an individual partition 
enough that the .toLocalIterator has significantly reduced memory pressure on 
the driver, as it loads a partition at a time into the driver.

  was:
When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
user has the ability to choose whether or not to perform a shuffle.  However 
when increasing partition count there is no option of whether to perform a 
shuffle or not.

This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that 
performs a repartition to a higher partition count without a shuffle.

The motivating use case is to decrease the size of an individual partition 
enough that the .toLocalIterator has significantly reduced memory pressure on 
the driver, as it loads a partition at a time into the driver.


> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5286) Fail to drop an invalid table when using the data source API

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335835#comment-14335835
 ] 

Apache Spark commented on SPARK-5286:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4755

> Fail to drop an invalid table when using the data source API
> 
>
> Key: SPARK-5286
> URL: https://issues.apache.org/jira/browse/SPARK-5286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Example
> {code}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
>   path 'it is not a path at all!'
> )
> DROP TABLE jsonTable
> {code}
> We will get 
> {code}
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
> [info]   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
> [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
> [info]   at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> [info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> [info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(Fu

[jira] [Updated] (SPARK-5286) Fail to drop an invalid table when using the data source API

2015-02-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5286:

Priority: Blocker  (was: Critical)

> Fail to drop an invalid table when using the data source API
> 
>
> Key: SPARK-5286
> URL: https://issues.apache.org/jira/browse/SPARK-5286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Example
> {code}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
>   path 'it is not a path at all!'
> )
> DROP TABLE jsonTable
> {code}
> We will get 
> {code}
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
> [info]   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
> [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
> [info]   at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> [info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> [info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [inf

[jira] [Created] (SPARK-5997) Increase partition count without performing a shuffle

2015-02-24 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-5997:
-

 Summary: Increase partition count without performing a shuffle
 Key: SPARK-5997
 URL: https://issues.apache.org/jira/browse/SPARK-5997
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Andrew Ash


When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
user has the ability to choose whether or not to perform a shuffle.  However 
when increasing partition count there is no option of whether to perform a 
shuffle or not.

This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that 
performs a repartition to a higher partition count without a shuffle.

The motivating use case is to decrease the size of an individual partition 
enough that the .toLocalIterator has significantly reduced memory pressure on 
the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5286) Fail to drop an invalid table when using the data source API

2015-02-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-5286:
-

I am reopen this issue because we need to catch all Throwables instead of just 
Exceptions.

> Fail to drop an invalid table when using the data source API
> 
>
> Key: SPARK-5286
> URL: https://issues.apache.org/jira/browse/SPARK-5286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.3.0
>
>
> Example
> {code}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
>   path 'it is not a path at all!'
> )
> DROP TABLE jsonTable
> {code}
> We will get 
> {code}
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
> [info]   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
> [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
> [info]   at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> [info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> [info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$an

[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2

2015-02-24 Thread Michael Nazario (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335827#comment-14335827
 ] 

Michael Nazario commented on SPARK-5978:


If I get the chance I'll look into it. I'm not planning to put any time into 
this right now since I no longer depend on the correctness of the example code. 
However, it would definitely be nice for Spark users to be able to run this 
example code on a hadoop 2 build like cdh4.

> Spark examples cannot compile with Hadoop 2
> ---
>
> Key: SPARK-5978
> URL: https://issues.apache.org/jira/browse/SPARK-5978
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, PySpark
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Michael Nazario
>  Labels: hadoop-version
>
> This is a regression from Spark 1.1.1.
> The Spark Examples includes an example for an avro converter for PySpark. 
> When I was trying to debug a problem, I discovered that even though you can 
> build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 
> somewhere else in the examples code.
> An easy fix would be to separate the examples into hadoop specific versions. 
> Another way would be to fix the hbase dependencies so that they don't rely on 
> hadoop 1 specific code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5979) `--packages` should not exclude spark streaming assembly jars for kafka and flume

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335823#comment-14335823
 ] 

Apache Spark commented on SPARK-5979:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4754

> `--packages` should not exclude spark streaming assembly jars for kafka and 
> flume 
> --
>
> Key: SPARK-5979
> URL: https://issues.apache.org/jira/browse/SPARK-5979
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently `--packages` has an exclude rule for all dependencies with the 
> groupId `org.apache.spark` assuming that these are packaged inside the 
> spark-assembly jar. This is not the case and more fine grained filtering is 
> required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5816) Add huge backward compatibility warning in DriverWrapper

2015-02-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5816.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0  (was: 1.3.0, 1.4.0)

> Add huge backward compatibility warning in DriverWrapper
> 
>
> Key: SPARK-5816
> URL: https://issues.apache.org/jira/browse/SPARK-5816
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.3.0
>
>
> As of Spark 1.3, we provide backward and forward compatibility in standalone 
> cluster mode through the REST submission gateway. HOWEVER, it nevertheless 
> goes through the legacy o.a.s.deploy.DriverWrapper, and the semantics of the 
> command line arguments there must not change. For instance, this was broken 
> in commit 20a6013106b56a1a1cc3e8cda092330ffbe77cc3.
> There is currently no warning against that in the class and so we should add 
> one before it's too late.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5996) DataFrame.collect() doesn't recognize UDTs

2015-02-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5996:


 Summary: DataFrame.collect() doesn't recognize UDTs
 Key: SPARK-5996
 URL: https://issues.apache.org/jira/browse/SPARK-5996
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Michael Armbrust
Priority: Blocker


{code}
import org.apache.spark.mllib.linalg._
case class Test(data: Vector)
val df = sqlContext.createDataFrame(Seq(Test(Vectors.dense(1.0, 2.0
df.collect()[0].getAs[Vector](0)
{code}

throws an exception. `collect()` returns `Row` instead of `Vector`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time

2015-02-24 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335757#comment-14335757
 ] 

Kay Ousterhout commented on SPARK-5845:
---

I'd go with (b) -- it's fine (and good, I think!) to include time to cleanup 
the partition writers, since this also involves interacting with the disk.  
Thanks!

> Time to cleanup spilled shuffle files not included in shuffle write time
> 
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5995) Make ML Prediction Developer APIs public

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5995:


 Summary: Make ML Prediction Developer APIs public
 Key: SPARK-5995
 URL: https://issues.apache.org/jira/browse/SPARK-5995
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


Previously, some Developer APIs were added to spark.ml for classification and 
regression to make it easier to add new algorithms and models: [SPARK-4789]  
There are ongoing discussions about the best design of the API.  This JIRA is 
to continue that discussion and try to finalize those Developer APIs so that 
they can be made public.

Please see [this design doc from SPARK-4789 | 
https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
 for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
* Should the strongly typed API for transform() be public (vs. protected)?
* What transformation methods should the API make developers implement for 
classification?  (See details below.)
* Should there be a way to transform a single Row (instead of only DataFrames)?

More on "What transformation methods should the API make developers implement 
for classification?":
* Goals:
** Optimize transform: Make it fast, and make it output only the desired 
columns.
** Easy development
** Support Classifier, Regressor, and ProbabilisticClassifier
* (currently) Developers implement predictX methods for each output column X.  
They may override transform() to optimize speed.
** Pros: predictX is easy to understand.
** Cons: An optimized transform() is annoying to write.
* Developers implement more basic transformation methods, such as features2raw, 
raw2pred, raw2prob.
** Pros: Abstract classes may implement optimized transform().
** Cons: Different types of predictors require different methods:
*** Predictor and Regressor: features2pred
*** Classifier: features2raw, raw2pred
*** ProbabilisticClassifier: raw2prob
* Developers implement a single predict() method which takes parameters for 
what columns to output (returning tuple or some type with None for missing 
values).  Abstract classes take the outputs they want and put them into columns.
** Pros: Developers only write 1 method and can optimize it as much as they 
want.  It could be more optimized than the previous 2 options; e.g., if 
LogisticRegressionModel only wants the prediction, then it never has to 
construct intermediate results such as the vector of raw predictions.
** Cons: predict() will have a different signature for different abstractions, 
based on the possible output columns.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5994) Python DataFrame documentation fixes

2015-02-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335750#comment-14335750
 ] 

Reynold Xin commented on SPARK-5994:


Added one more: "GroupedData doesn't have any description in the table of 
contents for pyspark.sql module"

> Python DataFrame documentation fixes
> 
>
> Key: SPARK-5994
> URL: https://issues.apache.org/jira/browse/SPARK-5994
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>
> select empty should NOT be the same as select(*). make sure selectExpr is 
> behaving the same.
> join param documentation
> link to source doesn't work in jekyll generated file
> cross reference of columns (i.e. enabling linking)
> show(): move df example before df.show()
> move tests in SQLContext out of docstring otherwise doc is too long
> Column.desc and .asc doesn't have any documentation
> in documentation, sort functions.* (and add the functions defined in the dict 
> automatically)
> if possible, add index for all functions and classes
> GroupedData doesn't have any description in the table of contents for 
> pyspark.sql module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5994) Python DataFrame documentation fixes

2015-02-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5994:
---
Description: 
select empty should NOT be the same as select(*). make sure selectExpr is 
behaving the same.

join param documentation

link to source doesn't work in jekyll generated file

cross reference of columns (i.e. enabling linking)

show(): move df example before df.show()

move tests in SQLContext out of docstring otherwise doc is too long

Column.desc and .asc doesn't have any documentation

in documentation, sort functions.* (and add the functions defined in the dict 
automatically)

if possible, add index for all functions and classes

GroupedData doesn't have any description in the table of contents for 
pyspark.sql module

  was:
select empty should NOT be the same as select(*). make sure selectExpr is 
behaving the same.

join param documentation

link to source doesn't work in jekyll generated file

cross reference of columns (i.e. enabling linking)

show(): move df example before df.show()

move tests in SQLContext out of docstring otherwise doc is too long

Column.desc and .asc doesn't have any documentation

in documentation, sort functions.* (and add the functions defined in the dict 
automatically)

if possible, add index for all functions and classes



> Python DataFrame documentation fixes
> 
>
> Key: SPARK-5994
> URL: https://issues.apache.org/jira/browse/SPARK-5994
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>
> select empty should NOT be the same as select(*). make sure selectExpr is 
> behaving the same.
> join param documentation
> link to source doesn't work in jekyll generated file
> cross reference of columns (i.e. enabling linking)
> show(): move df example before df.show()
> move tests in SQLContext out of docstring otherwise doc is too long
> Column.desc and .asc doesn't have any documentation
> in documentation, sort functions.* (and add the functions defined in the dict 
> automatically)
> if possible, add index for all functions and classes
> GroupedData doesn't have any description in the table of contents for 
> pyspark.sql module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time

2015-02-24 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335751#comment-14335751
 ] 

Ilya Ganelin commented on SPARK-5845:
-

My mistake - missed your comment about the spill files in the detailed 
description. Given that we're interested in cleaning up the spill files which 
appear to be cleaned up in ExternalSorter.stop() (please correct me if I'm 
wrong), I would like to either 

a) Pass the context to the stop() method - this is possible since the 
SortShuffleWriter has visibility of the TaskContext (which in turn stores the 
metrics we're interested in).

b) (My preference since it won't break the existing interface) Surround 
sorter.stop() on line 91 of SortShuffleWriter.scala with a timer. The only 
downside to this second approach is that it will also include the cleanup of 
the partition writers. I'm not sure whether that should be included in this 
time computation. 

> Time to cleanup spilled shuffle files not included in shuffle write time
> 
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5994) Python DataFrame documentation fixes

2015-02-24 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5994:
--

 Summary: Python DataFrame documentation fixes
 Key: SPARK-5994
 URL: https://issues.apache.org/jira/browse/SPARK-5994
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
Priority: Blocker


select empty should NOT be the same as select(*). make sure selectExpr is 
behaving the same.

join param documentation

link to source doesn't work in jekyll generated file

cross reference of columns (i.e. enabling linking)

show(): move df example before df.show()

move tests in SQLContext out of docstring otherwise doc is too long

Column.desc and .asc doesn't have any documentation

in documentation, sort functions.* (and add the functions defined in the dict 
automatically)

if possible, add index for all functions and classes




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3956:
-
Assignee: (was: Davies Liu)

> Python API for Distributed Matrix
> -
>
> Key: SPARK-3956
> URL: https://issues.apache.org/jira/browse/SPARK-3956
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>
> Python API for distributed matrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5993) Published Kafka-assembly JAR was empty in 1.3.0-RC1

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335736#comment-14335736
 ] 

Apache Spark commented on SPARK-5993:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4753

> Published Kafka-assembly JAR was empty in 1.3.0-RC1
> ---
>
> Key: SPARK-5993
> URL: https://issues.apache.org/jira/browse/SPARK-5993
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> This is because the maven build generated two Jars-
> 1. an empty JAR file (since kafka-assembly has no code of its own)
> 2. a assembly JAR file containing everything in a different location as 1
> The maven publishing plugin uploaded 1 and not 2. 
> Instead if 2 is not configure to generate in a different location, there is 
> only 1 jar containing everything, which gets published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5993) Published Kafka-assembly JAR was empty in 1.3.0-RC1

2015-02-24 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-5993:


 Summary: Published Kafka-assembly JAR was empty in 1.3.0-RC1
 Key: SPARK-5993
 URL: https://issues.apache.org/jira/browse/SPARK-5993
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


This is because the maven build generated two Jars-
1. an empty JAR file (since kafka-assembly has no code of its own)
2. a assembly JAR file containing everything in a different location as 1

The maven publishing plugin uploaded 1 and not 2. 

Instead if 2 is not configure to generate in a different location, there is 
only 1 jar containing everything, which gets published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5992:
-
Description: Locality Sensitive Hashing (LSH) would be very useful for ML.  
It would be great to discuss some possible algorithms here, choose an API, and 
make a PR for an initial algorithm.  (was: Locality Sensitive Hashing (LSH) 
would be very useful for )

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5992:


 Summary: Locality Sensitive Hashing (LSH) for MLlib
 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley


Locality Sensitive Hashing (LSH) would be very useful for 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out

2015-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5751.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4720
[https://github.com/apache/spark/pull/4720]

> Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes 
> times out
> --
>
> Key: SPARK-5751
> URL: https://issues.apache.org/jira/browse/SPARK-5751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>  Labels: flaky-test
> Fix For: 1.3.0
>
>
> The "Test JDBC query execution" test case times out occasionally, all other 
> test cases are just fine. The failure output only contains service startup 
> command line without any log output. Guess somehow the test case misses the 
> log file path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3956:
-
Target Version/s: 1.4.0  (was: 1.2.0)

> Python API for Distributed Matrix
> -
>
> Key: SPARK-3956
> URL: https://issues.apache.org/jira/browse/SPARK-3956
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Minor
>
> Python API for distributed matrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3956:
-
Component/s: MLlib

> Python API for Distributed Matrix
> -
>
> Key: SPARK-3956
> URL: https://issues.apache.org/jira/browse/SPARK-3956
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Minor
>
> Python API for distributed matrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3956:
-
Priority: Major  (was: Minor)

> Python API for Distributed Matrix
> -
>
> Key: SPARK-3956
> URL: https://issues.apache.org/jira/browse/SPARK-3956
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Python API for distributed matrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema

2015-02-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3851:
---
Fix Version/s: 1.3.0

> Support for reading parquet files with different but compatible schema
> --
>
> Key: SPARK-3851
> URL: https://issues.apache.org/jira/browse/SPARK-3851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Right now it is required that all of the parquet files have the same schema.  
> It would be nice to support some safe subset of cases where the schemas of 
> files is different.  For example:
>  - Adding and removing nullable columns.
>  - Widening types (a column that is of both Int and Long type)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5991) Python API for ML model import/export

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5991:


 Summary: Python API for ML model import/export
 Key: SPARK-5991
 URL: https://issues.apache.org/jira/browse/SPARK-5991
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Critical


Many ML models support save/load in Scala and Java.  The Python API needs this. 
 It should mostly be a simple matter of calling the JVM methods for save/load, 
except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5990) Model import/export for IsotonicRegression

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5990:


 Summary: Model import/export for IsotonicRegression
 Key: SPARK-5990
 URL: https://issues.apache.org/jira/browse/SPARK-5990
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Add save/load for IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5985) sortBy -> orderBy in Python

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335693#comment-14335693
 ] 

Apache Spark commented on SPARK-5985:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4752

> sortBy -> orderBy in Python
> ---
>
> Key: SPARK-5985
> URL: https://issues.apache.org/jira/browse/SPARK-5985
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Seems like a mistake that we included sortBy but not orderBy in Python 
> DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5989) Model import/export for LDAModel

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5989:


 Summary: Model import/export for LDAModel
 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5988:


 Summary: Model import/export for PowerIterationClusteringModel
 Key: SPARK-5988
 URL: https://issues.apache.org/jira/browse/SPARK-5988
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5987) Model import/export for GaussianMixtureModel

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5987:


 Summary: Model import/export for GaussianMixtureModel
 Key: SPARK-5987
 URL: https://issues.apache.org/jira/browse/SPARK-5987
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Support save/load for GaussianMixtureModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5986) Model import/export for KMeansModel

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5986:


 Summary: Model import/export for KMeansModel
 Key: SPARK-5986
 URL: https://issues.apache.org/jira/browse/SPARK-5986
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Support save/load for KMeansModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5985) sortBy -> orderBy in Python

2015-02-24 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5985:
--

 Summary: sortBy -> orderBy in Python
 Key: SPARK-5985
 URL: https://issues.apache.org/jira/browse/SPARK-5985
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Seems like a mistake that we included sortBy but not orderBy in Python 
DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time

2015-02-24 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335676#comment-14335676
 ] 

Kay Ousterhout commented on SPARK-5845:
---

[~pwendell] that's exactly what I meant -- I just updated the JIRA description.

> Time to cleanup spilled shuffle files not included in shuffle write time
> 
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time

2015-02-24 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-5845:
--
Summary: Time to cleanup spilled shuffle files not included in shuffle 
write time  (was: Time to cleanup intermediate shuffle files not included in 
shuffle write time)

> Time to cleanup spilled shuffle files not included in shuffle write time
> 
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5962) [MLLIB] Python support for Power Iteration Clustering

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5962:
-
Assignee: Stephen Boesch

> [MLLIB] Python support for Power Iteration Clustering
> -
>
> Key: SPARK-5962
> URL: https://issues.apache.org/jira/browse/SPARK-5962
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Stephen Boesch
>Assignee: Stephen Boesch
>  Labels: python
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Add python support for the Power Iteration Clustering feature.  Here is a 
> fragment of the python API as we plan to implement it:
>   /**
>* Java stub for Python mllib PowerIterationClustering.run()
>*/
>   def trainPowerIterationClusteringModel(
>   data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)],
>   k: Int,
>   maxIterations: Int,
>   runs: Int,
>   initializationMode: String,
>   seed: java.lang.Long): PowerIterationClusteringModel = {
> val picAlg = new PowerIterationClustering()
>   .setK(k)
>   .setMaxIterations(maxIterations)
> try {
>   picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
> } finally {
>   data.rdd.unpersist(blocking = false)
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5963) [MLLIB] Python support for Power Iteration Clustering

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-5963.

Resolution: Duplicate
  Assignee: (was: Stephen Boesch)

> [MLLIB] Python support for Power Iteration Clustering
> -
>
> Key: SPARK-5963
> URL: https://issues.apache.org/jira/browse/SPARK-5963
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Stephen Boesch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Add python support for the Power Iteration Clustering feature.  Here is a 
> fragment of the python API as we plan to implement it:
>   /**
>* Java stub for Python mllib PowerIterationClustering.run()
>*/
>   def trainPowerIterationClusteringModel(
>   data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)],
>   k: Int,
>   maxIterations: Int,
>   runs: Int,
>   initializationMode: String,
>   seed: java.lang.Long): PowerIterationClusteringModel = {
> val picAlg = new PowerIterationClustering()
>   .setK(k)
>   .setMaxIterations(maxIterations)
> try {
>   picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
> } finally {
>   data.rdd.unpersist(blocking = false)
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5904) DataFrame methods with varargs do not work in Java

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335674#comment-14335674
 ] 

Apache Spark commented on SPARK-5904:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4751

> DataFrame methods with varargs do not work in Java
> --
>
> Key: SPARK-5904
> URL: https://issues.apache.org/jira/browse/SPARK-5904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Reynold Xin
>Priority: Blocker
>  Labels: DataFrame
> Fix For: 1.3.0
>
>
> DataFrame methods with varargs fail when called from Java due to a bug in 
> Scala.
> This can be produced by, e.g., modifying the end of the example 
> ml.JavaSimpleParamsExample in the master branch:
> {code}
> DataFrame results = model2.transform(test);
> results.printSchema(); // works
> results.collect(); // works
> results.filter("label > 0.0").count(); // works
> for (Row r: results.select("features", "label", "myProbability", 
> "prediction").collect()) { // fails on select
>   System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + 
> r.get(2)
>   + ", prediction=" + r.get(3));
> }
> {code}
> I have also tried groupBy and found that failed too.
> The error looks like this:
> {code}
> Exception in thread "main" java.lang.AbstractMethodError: 
> org.apache.spark.sql.DataFrameImpl.groupBy(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/spark/sql/GroupedData;
>   at 
> org.apache.spark.examples.ml.JavaSimpleParamsExample.main(JavaSimpleParamsExample.java:108)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> The error appears to be from this Scala bug with using varargs in an abstract 
> method:
> [https://issues.scala-lang.org/browse/SI-9013]
> My current plan is to move the implementations of the methods with varargs 
> from DataFrameImpl to DataFrame.
> However, this may cause issues with IncomputableColumn---feedback??
> Thanks to [~joshrosen] for figuring the bug and fix out!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time

2015-02-24 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335667#comment-14335667
 ] 

Patrick Wendell commented on SPARK-5845:


[~kayousterhout] did you mean the time required to delete spill files used 
during aggregation? For the shuffle files themselves, they are deleted 
asynchronously, as [~ilganeli] has mentioned.

> Time to cleanup intermediate shuffle files not included in shuffle write time
> -
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time

2015-02-24 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335646#comment-14335646
 ] 

Ilya Ganelin edited comment on SPARK-5845 at 2/24/15 11:19 PM:
---

If I understand correctly, the file cleanup happens in 
IndexShuffleBlockManager:::removeDataByMap(), which is called from either the 
SortShuffleManager or the SortShuffleWriter. The problem is that these classes 
do not have any knowledge of the currently collected metrics. Furthermore, the 
disk cleanup is, unless configured in the SparkConf, triggered asynchronously 
via the RemoveShuffle message so there doesn't appear to be a straightforward 
way to provide a set of metrics to be updated. 

Do you have any suggestions for getting around this? Please let me know, thank 
you. 


was (Author: ilganeli):
If I understand correctly, the file cleanup happens in 
IndexShuffleBlockManager:::removeDataByMap(), which is called from either the 
SortShuffleManager or the SortShuffleWriter. The problem is that these classes 
do not have any knowledge of the currently collected metrics. Furthermore, the 
disk cleanup is, unless configured in the SparkConf, triggered asynchronously 
via the RemoveShuffle() message so there doesn't appear to be a straightforward 
way to provide a set of metrics to be updated. 

Do you have any suggestions for getting around this? Please let me know, thank 
you. 

> Time to cleanup intermediate shuffle files not included in shuffle write time
> -
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time

2015-02-24 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335646#comment-14335646
 ] 

Ilya Ganelin commented on SPARK-5845:
-

If I understand correctly, the file cleanup happens in 
IndexShuffleBlockManager:::removeDataByMap(), which is called from either the 
SortShuffleManager or the SortShuffleWriter. The problem is that these classes 
do not have any knowledge of the currently collected metrics. Furthermore, the 
disk cleanup is, unless configured in the SparkConf, triggered asynchronously 
via the RemoveShuffle() message so there doesn't appear to be a straightforward 
way to provide a set of metrics to be updated. 

Do you have any suggestions for getting around this? Please let me know, thank 
you. 

> Time to cleanup intermediate shuffle files not included in shuffle write time
> -
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5436.
--
  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0

> Validate GradientBoostedTrees during training
> -
>
> Key: SPARK-5436
> URL: https://issues.apache.org/jira/browse/SPARK-5436
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
> Fix For: 1.4.0
>
>
> For Gradient Boosting, it would be valuable to compute test error on a 
> separate validation set during training.  That way, training could stop early 
> based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests

2015-02-24 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335637#comment-14335637
 ] 

Brennon York commented on SPARK-4123:
-

Gents, need a bit of input on this one. Looks like, to get the expected output 
that [~pwendell] showed, I would need to run a {{mvn clean ... install}} before 
I could do any sub-project traversal ([SO 
reference|https://stackoverflow.com/questions/12223754/how-to-maven-sub-projects-with-inter-dependencies]
 with similar issue). I've seen this issue in 
[SPARK-3355|https://github.com/apache/spark/pull/4734] as well and, in that 
case, just avoided it. I've retested on my machine by running the above lines 
without installation, then with installation, then clearing org/apache/spark 
from my ~/.m2 directory and without again. As expected, it failed, worked, and 
then failed again. So... my question is how does Jenkins store state wrt the 
local repository? I'm wondering if there might not be another way to grab this 
information, but if we use Maven with the above command and we need to install 
spark into a local directory I can imagine build failures everywhere, esp. if 
we clear / update that directory.

Thoughts?

cc [~shaneknapp] for possible Jenkins advice

> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5984) TimSort broken

2015-02-24 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5984:
--

 Summary: TimSort broken
 Key: SPARK-5984
 URL: https://issues.apache.org/jira/browse/SPARK-5984
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1, 1.2.0, 1.1.1, 1.1.0, 1.3.0
Reporter: Reynold Xin
Assignee: Aaron Davidson
Priority: Minor


See 
http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/

Our TimSort is based on Android's TimSort, which is broken in some corner case. 
Marking it minor as this problem exists for almost all TimSort implementations 
out there, including Android, OpenJDK, Python, and it hasn't manifested itself 
in practice yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5973.
--
   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0

> zip two rdd with AutoBatchedSerializer will fail
> 
>
> Key: SPARK-5973
> URL: https://issues.apache.org/jira/browse/SPARK-5973
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0, 1.2.2
>
>
> zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by 
> SPARK-4841
> {code}
> >> a.zip(b).count()
> 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in 
> process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func
> return f(iterator)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in 
> load_stream
> " in pair: (%d, %d)" % (len(keys), len(vals)))
> ValueError: Can not deserialize RDD with different number of items in pair: 
> (123, 64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5973:
-
Assignee: Davies Liu

> zip two rdd with AutoBatchedSerializer will fail
> 
>
> Key: SPARK-5973
> URL: https://issues.apache.org/jira/browse/SPARK-5973
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by 
> SPARK-4841
> {code}
> >> a.zip(b).count()
> 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in 
> process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func
> return f(iterator)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in 
> load_stream
> " in pair: (%d, %d)" % (len(keys), len(vals)))
> ValueError: Can not deserialize RDD with different number of items in pair: 
> (123, 64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5974) Add save/load to examples in ML guide

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335609#comment-14335609
 ] 

Apache Spark commented on SPARK-5974:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4750

> Add save/load to examples in ML guide
> -
>
> Key: SPARK-5974
> URL: https://issues.apache.org/jira/browse/SPARK-5974
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> We should add save/load (model import/export) to the Scala and Java code 
> examples in the ML guide.  This is not yet supported in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5980) Add GradientBoostedTrees Python examples to ML guide

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335610#comment-14335610
 ] 

Apache Spark commented on SPARK-5980:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4750

> Add GradientBoostedTrees Python examples to ML guide
> 
>
> Key: SPARK-5980
> URL: https://issues.apache.org/jira/browse/SPARK-5980
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> GBT now has a Python API and should have examples in the ML guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-02-24 Thread Alex (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335577#comment-14335577
 ] 

Alex commented on SPARK-2344:
-

You're right of course - i did not mean to leave it there (only for now).

Sorry that I did not answered you earlier, I have a test of Thursday and I
spend all my  time studying After Thursday I will take a look at your
code.



Alex




> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5982) Remove Local Read Time

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335576#comment-14335576
 ] 

Apache Spark commented on SPARK-5982:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/4749

> Remove Local Read Time
> --
>
> Key: SPARK-5982
> URL: https://issues.apache.org/jira/browse/SPARK-5982
> Project: Spark
>  Issue Type: Bug
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Blocker
>
> LocalReadTime was added to TaskMetrics for the 1.3.0 release.  However, this 
> time is actually only a small subset of the local read time, because local 
> shuffle files are memory mapped, so most of the read time occurs later, as 
> data is read from the memory mapped files and the data actually gets read 
> from disk.  We should remove this before the 1.3.0 release, so we never 
> expose this incomplete info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-02-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335573#comment-14335573
 ] 

Nicholas Chammas commented on SPARK-5312:
-

It's something to consider I guess. Spark provides strong guarantees about API 
stability and the like. 

Making it easy for reviewers to catch changes to public classes is supposed to 
help with that. What we have is perhaps good for now, and perhaps the 
foreseeable future.

So maybe we should resolve this issue for now and just keep it in mind in the 
future.

cc [~pwendell]

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5983) Don't respond to HTTP TRACE in HTTP-based UIs

2015-02-24 Thread Sean Owen (JIRA)
Sean Owen created SPARK-5983:


 Summary: Don't respond to HTTP TRACE in HTTP-based UIs
 Key: SPARK-5983
 URL: https://issues.apache.org/jira/browse/SPARK-5983
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sean Owen
Priority: Minor


This was flagged a while ago during a routine security scan: the HTTP-based 
Spark services respond to an HTTP TRACE command. This is basically an HTTP verb 
that has no practical use, and has a pretty theoretical chance of being an 
exploit vector. It is flagged as a security issue by one common tool, however.

Spark's HTTP services are based on Jetty, which by default does not enable 
TRACE (like Tomcat). However, the services do reply to TRACE requests. I think 
it is because the use of Jetty is pretty 'raw' and does not enable much of the 
default additional configuration you might get by using Jetty as a standalone 
server.

I know that it is at least possible to stop the reply to TRACE with a few extra 
lines of code, so I think it is worth shutting off TRACE requests. Although the 
security risk is quite theoretical, it should be easy to fix and bring the 
Spark services into line with the common default of HTTP servers today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-02-24 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335565#comment-14335565
 ] 

Brennon York commented on SPARK-5312:
-

New thoughts... So I checked out the link which seems like it would def. do 
what we want it to do, but at the cost of much more complexity (unless I'm 
missing something). It looks like the easiest way to get the tool to work would 
be to add a separate scala program that would need to be executed (in the 
{{run-tests-jenkins}} script) producing the public classes. Pros? It should get 
us exactly the classes we want, then we could {{diff}} that against the master 
branch and be on our way. Cons? An entire other scala program (again, unless 
I'm missing a way to integrate it) that would need to be maintained / managed. 
Not sure how important this is given the amount of code added to maintain into 
the project. Thoughts?

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5981) pyspark ML models should support predict/transform on vector within map

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5981:
-
Description: 
Currently, most Python models only have limited support for single-vector 
prediction.
E.g., one can call {code}model.predict(myFeatureVector){code} for a single 
instance, but that fails within a map for Python ML models and transformers 
which use JavaModelWrapper:
{code}
data.map(lambda features: model.predict(features))
{code}
This fails because JavaModelWrapper.call uses the SparkContext (within the 
transformation).  (It works for linear models, which do prediction within 
Python.)

Supporting prediction within a map would require storing the model and doing 
prediction/transformation within Python.


  was:
Many Python ML models and transformers use JavaModelWrapper to call methods in 
the JVM, such as predict() and transform().  It is common to write:
{code}
data.map(lambda features: model.predict(features))
{code}
This fails because JavaModelWrapper.call uses the SparkContext (within the 
transformation).

Note: It is possible to do a workaround using batch predict if 
models/transformers support it:
{code}
model.predict(data)
{code}

However, this is still a major problem.


> pyspark ML models should support predict/transform on vector within map
> ---
>
> Key: SPARK-5981
> URL: https://issues.apache.org/jira/browse/SPARK-5981
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Currently, most Python models only have limited support for single-vector 
> prediction.
> E.g., one can call {code}model.predict(myFeatureVector){code} for a single 
> instance, but that fails within a map for Python ML models and transformers 
> which use JavaModelWrapper:
> {code}
> data.map(lambda features: model.predict(features))
> {code}
> This fails because JavaModelWrapper.call uses the SparkContext (within the 
> transformation).  (It works for linear models, which do prediction within 
> Python.)
> Supporting prediction within a map would require storing the model and doing 
> prediction/transformation within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5981:
-
Target Version/s: 1.4.0  (was: 1.3.0)

> pyspark ML models fail during predict/transform on vector within map
> 
>
> Key: SPARK-5981
> URL: https://issues.apache.org/jira/browse/SPARK-5981
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Many Python ML models and transformers use JavaModelWrapper to call methods 
> in the JVM, such as predict() and transform().  It is common to write:
> {code}
> data.map(lambda features: model.predict(features))
> {code}
> This fails because JavaModelWrapper.call uses the SparkContext (within the 
> transformation).
> Note: It is possible to do a workaround using batch predict if 
> models/transformers support it:
> {code}
> model.predict(data)
> {code}
> However, this is still a major problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5981:
-
Issue Type: Improvement  (was: Bug)

> pyspark ML models fail during predict/transform on vector within map
> 
>
> Key: SPARK-5981
> URL: https://issues.apache.org/jira/browse/SPARK-5981
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Many Python ML models and transformers use JavaModelWrapper to call methods 
> in the JVM, such as predict() and transform().  It is common to write:
> {code}
> data.map(lambda features: model.predict(features))
> {code}
> This fails because JavaModelWrapper.call uses the SparkContext (within the 
> transformation).
> Note: It is possible to do a workaround using batch predict if 
> models/transformers support it:
> {code}
> model.predict(data)
> {code}
> However, this is still a major problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5981:
-
Priority: Major  (was: Critical)

> pyspark ML models fail during predict/transform on vector within map
> 
>
> Key: SPARK-5981
> URL: https://issues.apache.org/jira/browse/SPARK-5981
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Many Python ML models and transformers use JavaModelWrapper to call methods 
> in the JVM, such as predict() and transform().  It is common to write:
> {code}
> data.map(lambda features: model.predict(features))
> {code}
> This fails because JavaModelWrapper.call uses the SparkContext (within the 
> transformation).
> Note: It is possible to do a workaround using batch predict if 
> models/transformers support it:
> {code}
> model.predict(data)
> {code}
> However, this is still a major problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5981) pyspark ML models should support predict/transform on vector within map

2015-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5981:
-
Summary: pyspark ML models should support predict/transform on vector 
within map  (was: pyspark ML models fail during predict/transform on vector 
within map)

> pyspark ML models should support predict/transform on vector within map
> ---
>
> Key: SPARK-5981
> URL: https://issues.apache.org/jira/browse/SPARK-5981
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Many Python ML models and transformers use JavaModelWrapper to call methods 
> in the JVM, such as predict() and transform().  It is common to write:
> {code}
> data.map(lambda features: model.predict(features))
> {code}
> This fails because JavaModelWrapper.call uses the SparkContext (within the 
> transformation).
> Note: It is possible to do a workaround using batch predict if 
> models/transformers support it:
> {code}
> model.predict(data)
> {code}
> However, this is still a major problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5982) Remove Local Read Time

2015-02-24 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-5982:
-

 Summary: Remove Local Read Time
 Key: SPARK-5982
 URL: https://issues.apache.org/jira/browse/SPARK-5982
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Blocker


LocalReadTime was added to TaskMetrics for the 1.3.0 release.  However, this 
time is actually only a small subset of the local read time, because local 
shuffle files are memory mapped, so most of the read time occurs later, as data 
is read from the memory mapped files and the data actually gets read from disk. 
 We should remove this before the 1.3.0 release, so we never expose this 
incomplete info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-24 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335518#comment-14335518
 ] 

Corey J. Nolet edited comment on SPARK-5140 at 2/24/15 9:50 PM:


I have a framework (similar to cascading) where users wire together their RDD 
transformations in reusable components and the framework wires them up together 
based on dependencies that they define. There is a notion of sinks- or data 
outputs and they are also encapsulated in reusable componentry. A sink depends 
on sources- The sinks are actually what get executed. I wanted to be able to 
execute them in parallel and just have spark be smart enough to figure out what 
needs to block (whenever there's a cached rdd) in order to make that possible. 
We're comparing the same data transformations to MapReduce that was written in 
JAQL and the single-threaded execution of the sinks is causing absolutely 
wretched run times.

I'm not saying an internal requirement from my framework should necessaily be 
levied against the spark features- however, this change (seeming like it would 
only affect the driver side scheduling) would allow the spark context to be 
truly thread-safe. I've tried running jobs that over-utilize the resources in 
parallel until one of them fully caches and it really just slows things down 
and uses too many resources- I can't see why anyone submitting rdds in 
different threads that depend on cached RDDs wouldn't want those threads to 
block for the parents but maybe I'm not thinking abstract enough. 

For now, I'm going to propagate down my data source tree breadth first and 
no-op those sources that return CachedRDDs so that their children can be 
scheduled in different threads. 


was (Author: sonixbp):
I have a framework (similar to cascading) where users wire together their RDD 
transformations in reusable components and the framework wires them up together 
based on dependencies that they define. There is a notion of sinks- or data 
outputs and they are also encapsulated in reusable componentry. A sink depends 
on sources- The sinks are actually what get executed. I wanted to be able to 
execute them in parallel and just have spark be smart enough to figure out what 
needs to block (whenever there's a cached rdd) in order to make that possible. 
We're comparing the same data transformations to MapReduce that was written in 
JAQL and the single-threaded execution of the sinks is causing absolutely 
wretched run times.

I'm not saying an internal requirement from my framework should necessaily be 
levied against the spark features- however, this would change (seeming like it 
would only affect the driver side scheduling) would allow the spark context to 
be truly thread-safe. I've tried running jobs that over-utilize the resources 
in parallel until one of them fully caches and it really just slows things down 
and uses too many resources- I can't see why anyone submitting rdds in 
different threads that depend on cached RDDs wouldn't want those threads to 
block for the parents but maybe I'm not thinking abstract enough. 

For now, I'm going to propagate down my data source tree breadth first and 
no-op those sources that return CachedRDDs so that their children can be 
scheduled in different threads. 

> Two RDDs which are scheduled concurrently should be able to wait on parent in 
> all cases
> ---
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Corey J. Nolet
>  Labels: features
>
> Not sure if this would change too much of the internals to be included in the 
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the 
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to 
> finish caching" logic is on a per-task basis, and it only works on tasks that 
> happen to be executing on the same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that 
> partition on that executor. The problem here, though, is that the cache is in 
> progress, and so the tasks are still scheduled randomly (or with whatever 
> locality the data source has), so tasks which end up on different machines 
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i 
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Awai

[jira] [Created] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5981:


 Summary: pyspark ML models fail during predict/transform on vector 
within map
 Key: SPARK-5981
 URL: https://issues.apache.org/jira/browse/SPARK-5981
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Critical


Many Python ML models and transformers use JavaModelWrapper to call methods in 
the JVM, such as predict() and transform().  It is common to write:
{code}
data.map(lambda features: model.predict(features))
{code}
This fails because JavaModelWrapper.call uses the SparkContext (within the 
transformation).

Note: It is possible to do a workaround using batch predict if 
models/transformers support it:
{code}
model.predict(data)
{code}

However, this is still a major problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-24 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335518#comment-14335518
 ] 

Corey J. Nolet commented on SPARK-5140:
---

I have a framework (similar to cascading) where users wire together their RDD 
transformations in reusable components and the framework wires them up together 
based on dependencies that they define. There is a notion of sinks- or data 
outputs and they are also encapsulated in reusable componentry. A sink depends 
on sources- The sinks are actually what get executed. I wanted to be able to 
execute them in parallel and just have spark be smart enough to figure out what 
needs to block (whenever there's a cached rdd) in order to make that possible. 
We're comparing the same data transformations to MapReduce that was written in 
JAQL and the single-threaded execution of the sinks is causing absolutely 
wretched run times.

I'm not saying an internal requirement from my framework should necessaily be 
levied against the spark features- however, this would change (seeming like it 
would only affect the driver side scheduling) would allow the spark context to 
be truly thread-safe. I've tried running jobs that over-utilize the resources 
in parallel until one of them fully caches and it really just slows things down 
and uses too many resources- I can't see why anyone submitting rdds in 
different threads that depend on cached RDDs wouldn't want those threads to 
block for the parents but maybe I'm not thinking abstract enough. 

For now, I'm going to propagate down my data source tree breadth first and 
no-op those sources that return CachedRDDs so that their children can be 
scheduled in different threads. 

> Two RDDs which are scheduled concurrently should be able to wait on parent in 
> all cases
> ---
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Corey J. Nolet
>  Labels: features
>
> Not sure if this would change too much of the internals to be included in the 
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the 
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to 
> finish caching" logic is on a per-task basis, and it only works on tasks that 
> happen to be executing on the same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that 
> partition on that executor. The problem here, though, is that the cache is in 
> progress, and so the tasks are still scheduled randomly (or with whatever 
> locality the data source has), so tasks which end up on different machines 
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i 
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first 
> run has all tasks take 10 seconds. The second has about 50% of its tasks take 
> 10 seconds, and the rest just wait for the first stage to finish. The last 
> two runs have no tasks that take 10 seconds; all wait for the first two 
> stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out 
> that two RDDs depend on the same parent so that when the children are 
> scheduled concurrently, the first one to start will activate the parent and 
> both will wait on the parent. When the parent is done, they will both be able 
> to finish their work concurrently. We are trying to use this pattern by 
> having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail

2015-02-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-5973:
--
Description: 
zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by 
SPARK-4841

{code}
>> a.zip(b).count()
15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main
process()
  File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func
return f(iterator)
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in 
load_stream
" in pair: (%d, %d)" % (len(keys), len(vals)))
ValueError: Can not deserialize RDD with different number of items in pair: 
(123, 64)
{code}

  was:zip two rdd with AutoBatchedSerializer will fail, this bug was introduced 
by SPARK-4841


> zip two rdd with AutoBatchedSerializer will fail
> 
>
> Key: SPARK-5973
> URL: https://issues.apache.org/jira/browse/SPARK-5973
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Davies Liu
>Priority: Blocker
>
> zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by 
> SPARK-4841
> {code}
> >> a.zip(b).count()
> 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main
> process()
>   File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in 
> process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func
> return f(iterator)
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in 
> load_stream
> " in pair: (%d, %d)" % (len(keys), len(vals)))
> ValueError: Can not deserialize RDD with different number of items in pair: 
> (123, 64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2

2015-02-24 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5971:

Description: 
Right now, spark-ec2 can only launch Spark clusters that use the standalone 
manager.

Adding support for Mesos would be useful mostly for automated performance 
testing of Spark on Mesos.

  was:
Right now, spark-ec2 can only launching Spark clusters that use the standalone 
manager.

Adding support to launch Spark-on-Mesos clusters would be useful mostly for 
automated performance testing of Spark on Mesos.


> Add Mesos support to spark-ec2
> --
>
> Key: SPARK-5971
> URL: https://issues.apache.org/jira/browse/SPARK-5971
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Right now, spark-ec2 can only launch Spark clusters that use the standalone 
> manager.
> Adding support for Mesos would be useful mostly for automated performance 
> testing of Spark on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5952) Failure to lock metastore client in tableExists()

2015-02-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5952.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4746
[https://github.com/apache/spark/pull/4746]

> Failure to lock metastore client in tableExists()
> -
>
> Key: SPARK-5952
> URL: https://issues.apache.org/jira/browse/SPARK-5952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5979) `--packages` should not exclude spark streaming assembly jars for kafka and flume

2015-02-24 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-5979:
--

 Summary: `--packages` should not exclude spark streaming assembly 
jars for kafka and flume 
 Key: SPARK-5979
 URL: https://issues.apache.org/jira/browse/SPARK-5979
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Submit
Affects Versions: 1.3.0
Reporter: Burak Yavuz
Priority: Blocker


Currently `--packages` has an exclude rule for all dependencies with the 
groupId `org.apache.spark` assuming that these are packaged inside the 
spark-assembly jar. This is not the case and more fine grained filtering is 
required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5980) Add GradientBoostedTrees Python examples to ML guide

2015-02-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5980:


 Summary: Add GradientBoostedTrees Python examples to ML guide
 Key: SPARK-5980
 URL: https://issues.apache.org/jira/browse/SPARK-5980
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


GBT now has a Python API and should have examples in the ML guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2015-02-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335482#comment-14335482
 ] 

Xiangrui Meng commented on SPARK-2335:
--

[~Rusty] For exact k-NN, the problem is complexity. So I would prefer starting 
from the approximate algorithm directly. Yu's suggestion looks good to me. 
Let's discuss more about approximate k-NN to SPARK-2336.

> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335476#comment-14335476
 ] 

Xiangrui Meng commented on SPARK-2336:
--

[~Rusty] Could you provide a summary of your plan? For example,

1. Which paper are you going to follow? Any modification to the algorithm 
proposed in the paper?
2. What's the complexity of the algorithm and the expected scalability?
3. What is the trade-off between the approximate error and the cost? How do you 
want to expose it to users?

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail

2015-02-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335473#comment-14335473
 ] 

Josh Rosen commented on SPARK-5973:
---

Just for searchability's sake, what error message did this fail with before?

> zip two rdd with AutoBatchedSerializer will fail
> 
>
> Key: SPARK-5973
> URL: https://issues.apache.org/jira/browse/SPARK-5973
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Davies Liu
>Priority: Blocker
>
> zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by 
> SPARK-4841



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5976) Factors returned by ALS do not have partitioners associated.

2015-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335464#comment-14335464
 ] 

Apache Spark commented on SPARK-5976:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4748

> Factors returned by ALS do not have partitioners associated.
> 
>
> Key: SPARK-5976
> URL: https://issues.apache.org/jira/browse/SPARK-5976
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> The model trained by ALS requires partitioning information to do quick lookup 
> of a user/item factor for making recommendation on individual requests. In 
> the new implementation, we don't put partitioners with ALS, which would cause 
> performance regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2

2015-02-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335461#comment-14335461
 ] 

Sean Owen commented on SPARK-5978:
--

Pretty similar to https://issues.apache.org/jira/browse/SPARK-3039 ; I imagine 
some similar profile-based fix is in order. Care to take a shot at that?

> Spark examples cannot compile with Hadoop 2
> ---
>
> Key: SPARK-5978
> URL: https://issues.apache.org/jira/browse/SPARK-5978
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, PySpark
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Michael Nazario
>  Labels: hadoop-version
>
> This is a regression from Spark 1.1.1.
> The Spark Examples includes an example for an avro converter for PySpark. 
> When I was trying to debug a problem, I discovered that even though you can 
> build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 
> somewhere else in the examples code.
> An easy fix would be to separate the examples into hadoop specific versions. 
> Another way would be to fix the hbase dependencies so that they don't rely on 
> hadoop 1 specific code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5978) Spark examples cannot compile with Hadoop 2

2015-02-24 Thread Michael Nazario (JIRA)
Michael Nazario created SPARK-5978:
--

 Summary: Spark examples cannot compile with Hadoop 2
 Key: SPARK-5978
 URL: https://issues.apache.org/jira/browse/SPARK-5978
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark
Affects Versions: 1.2.1, 1.2.0
Reporter: Michael Nazario


This is a regression from Spark 1.1.1.

The Spark Examples includes an example for an avro converter for PySpark. When 
I was trying to debug a problem, I discovered that even though you can build 
with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 
somewhere else in the examples code.

An easy fix would be to separate the examples into hadoop specific versions. 
Another way would be to fix the hbase dependencies so that they don't rely on 
hadoop 1 specific code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3665) Java API for GraphX

2015-02-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3665:
-
Target Version/s: 1.4.0  (was: 1.3.0)

> Java API for GraphX
> ---
>
> Key: SPARK-3665
> URL: https://issues.apache.org/jira/browse/SPARK-3665
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, Java API
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The Java API will wrap the Scala API in a similar manner as JavaRDD. 
> Components will include:
> # JavaGraph
> #- removes optional param from persist, subgraph, mapReduceTriplets, 
> Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
> #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
> #- merges multiple parameters lists
> #- incorporates GraphOps
> # JavaVertexRDD
> # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3665) Java API for GraphX

2015-02-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3665:
-
Affects Version/s: 1.0.0

> Java API for GraphX
> ---
>
> Key: SPARK-3665
> URL: https://issues.apache.org/jira/browse/SPARK-3665
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, Java API
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The Java API will wrap the Scala API in a similar manner as JavaRDD. 
> Components will include:
> # JavaGraph
> #- removes optional param from persist, subgraph, mapReduceTriplets, 
> Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
> #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
> #- merges multiple parameters lists
> #- incorporates GraphOps
> # JavaVertexRDD
> # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >