[jira] [Commented] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example

2015-12-11 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052513#comment-15052513
 ] 

Xusen Yin commented on SPARK-11381:
---

[~somi...@us.ibm.com] This JIRA is blocked by 
https://issues.apache.org/jira/browse/SPARK-11399. You can take it later after 
that one being merged.

> Replace example code in mllib-linear-methods.md using include_example
> -
>
> Key: SPARK-11381
> URL: https://issues.apache.org/jira/browse/SPARK-11381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6363) make scala 2.11 default language

2015-12-11 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052631#comment-15052631
 ] 

Ismael Juma commented on SPARK-6363:


It's also worth pointing out that Scala 2.10 is no longer maintained since 
March 2015.

> make scala 2.11 default language
> 
>
> Key: SPARK-6363
> URL: https://issues.apache.org/jira/browse/SPARK-6363
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: antonkulaga
>Priority: Minor
>  Labels: scala
>
> Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
> support. So, it will be better if Spark binaries would be build with Scala 
> 2.11 by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-12-11 Thread Chandra Sekhar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052617#comment-15052617
 ] 

Chandra Sekhar commented on SPARK-10625:


Can i test this now? which version I have to download which contains this 
change. 

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example

2015-12-11 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052512#comment-15052512
 ] 

Xusen Yin commented on SPARK-11381:
---

[~somi...@us.ibm.com] This JIRA is blocked by 
https://issues.apache.org/jira/browse/SPARK-11399. You can take it later after 
that one being merged.

> Replace example code in mllib-linear-methods.md using include_example
> -
>
> Key: SPARK-11381
> URL: https://issues.apache.org/jira/browse/SPARK-11381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6363) make scala 2.11 default language

2015-12-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6363:
-
Target Version/s: 2.0.0

> make scala 2.11 default language
> 
>
> Key: SPARK-6363
> URL: https://issues.apache.org/jira/browse/SPARK-6363
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: antonkulaga
>Priority: Minor
>  Labels: scala
>
> Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 
> support. So, it will be better if Spark binaries would be build with Scala 
> 2.11 by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-12-11 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052603#comment-15052603
 ] 

Adam Roberts commented on SPARK-9858:
-

Modifying the UnsafeRowSerializer to always write/read in LE fixes the problem, 
therefore enabling tungsten features to be fully exploited regardless of 
endianness (not yet sure why only the aggregate functions are impacted, thought 
we'd have plenty of test failures). We can use 
LittleEndianDataInput/OutputStream to achieve this; part of the same package as 
ByteStreams. Will ensure the regular SparkSqlSerializer is OK too.

We're hitting a similar problem with the DatasetAggregatorSuite (instead of 1 
we get 9, instead of 2 we get 10, etc), I expect the root cause to be the same.

I'll get to work on the pull request, cheers 

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12264) Could DataType provide a TypeTag?

2015-12-11 Thread Andras Nemeth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052612#comment-15052612
 ] 

Andras Nemeth commented on SPARK-12264:
---

I guess my concrete proposal was a bit hidden in the last sentence of the 
original description:
Let's add a typeTag or scalaTypeTag method to DataType.

It's not that creating a mapping on the user side is terribly hard - although 
there are more complex types like maps and arrays which can be composed 
arbitrarily as far as I can tell, so you do have to do a bit of work to get it 
right.

It's more about this user implemented mapping being very fragile (e.g. I can 
definitely see more system types being added in the future) and duplicated at 
multiple clients.

Getting it at runtime from a concrete row is not great for many reasons:
- It only gives a ClassTag, not a TypeTag
- You may easily end up with a too concrete Class - e.g. maybe in the first 
row, the first element is a one element set, represented by a 
collection.immutable.HashSet$HashSet1. But that's not going to be a good class 
for all elements in the first column.
- It's not nice that you have to look at the actual data to understand what it 
is. What's the point of schemas then?

> Could DataType provide a TypeTag?
> -
>
> Key: SPARK-12264
> URL: https://issues.apache.org/jira/browse/SPARK-12264
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Andras Nemeth
>Priority: Minor
>
> We are writing code that's dealing with generic DataFrames as inputs and 
> further processes their contents with normal RDD operations (not SQL). We 
> need some mechanism that tells us exactly what Scala types we will find 
> inside a Row of a given DataFrame.
> The schema of the DataFrame contains this information in an abstract sense. 
> But we need to map it to TypeTags, as that's what the rest of the system uses 
> to identify what RDD contains what type of data - quite the natural choice in 
> Scala.
> As far as I can tell, there is no good way to do this today. For now we have 
> a hand coded mapping, but that feels very fragile as spark evolves. Is there 
> a better way I'm missing? And if not, could we create one? Adding a typeTag 
> or scalaTypeTag method to DataType, or at least to AtomicType  seems easy 
> enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN

2015-12-11 Thread Pierre Beauvois (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052737#comment-15052737
 ] 

Pierre Beauvois commented on SPARK-6918:


I found nothing about org.apache.spark.deploy.yarn.Client in the client code. 
The use case is reproducible at 100% and the error happens for any HBase tables.

I opened a new ticket for Spark 1.5.2 
(https://issues.apache.org/jira/browse/SPARK-12279).


> Secure HBase with Kerberos does not work over YARN
> --
>
> Key: SPARK-6918
> URL: https://issues.apache.org/jira/browse/SPARK-6918
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.2.1, 1.3.0, 1.3.1
>Reporter: Dean Chen
>Assignee: Dean Chen
> Fix For: 1.4.0
>
>
> Attempts to access HBase from Spark executors will fail at the auth to the 
> metastore with: _GSSException: No valid credentials provided (Mechanism 
> level: Failed to find any Kerberos tgt)_
> This is because HBase Kerberos auth token is not send to the executor. Will 
> need to have something similar to obtainTokensForNamenodes(used for HDFS) in 
> yarn/Client.scala. Storm also needed something similar: 
> https://github.com/apache/storm/pull/226
> I've created a patch for this that required an HBase dependency in the YARN 
> module that we've been using successfully at eBay but am working on a version 
> that does not require the HBase dependency by calling the class loader. 
> Should be ready in a few days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-11 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052759#comment-15052759
 ] 

Irakli Machabeli commented on SPARK-12218:
--

The bug itself is really dangerous, it's ok if it was simply crushing , 
throwing exception etc but it silently produces wrong results. 
Imagine coding in java and you have to worry if compiler correctly interprets 
&&,  || in if statement. that's disaster.
For me this is not critical, I'm still in try out mode and can always upgrade 
to 1.6 but for someone who uses spark 1.5  for real job, that's really bad.

> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-11 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052763#comment-15052763
 ] 

Xiao Li commented on SPARK-12218:
-

Agree! I will do a search to find out what happened in the push down and which 
PR fixes the logical plan changes. Will keep you posted. Thanks!

> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator

2015-12-11 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052718#comment-15052718
 ] 

Xusen Yin commented on SPARK-11136:
---

I add a [design 
doc|https://docs.google.com/document/d/1LSRQDXOepVsOsCRT_PFwuiS9qmbgCzEskVPKXdqHoX0/edit?usp=sharing]
 here so that we can talk about different implementations easily.

> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-12-11 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052852#comment-15052852
 ] 

Steve Loughran commented on SPARK-2356:
---

I've stuck up binaries compatible with Hadoop 2.6 & 2.7, to make installing 
things easier

* https://github.com/steveloughran/winutils

Note also Hadoop 2.8 includes HADOOP-10775, "fail with meaningful messages if 
winutils can't be found"



> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052755#comment-15052755
 ] 

Sean Owen commented on SPARK-11193:
---

[~phibit] are you able to test the change in 
https://github.com/apache/spark/pull/10203 by any chance -- does it look OK to 
you?

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12275) No plan for BroadcastHint in some condition

2015-12-11 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-12275:
--
Description: 
*Summary*
No plan for BroadcastHint is generated in some condition.

*Test Case*
{code}
val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
val parquetTempFile =
  "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), 
scala.util.Random.nextInt)
df1.write.parquet(parquetTempFile)
val pf1 = sqlContext.read.parquet(parquetTempFile)
#1. df1.join(broadcast(pf1)).count()
#2. broadcast(pf1).count()
{code}

*Result*
It will trigger assertion in QueryPlanner.scala, like below:
{code}
scala> df1.join(broadcast(pf1)).count()
java.lang.AssertionError: assertion failed: No plan for BroadcastHint
+- Relation[key#6,value#7] 
ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet]

at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
{code}

  was:
*Summary*
No plan for BroadcastHint is generated in some condition.

*Test Case*
{code}
val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
val parquetTempFile =
  "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), 
scala.util.Random.nextInt)
df1.write.parquet(parquetTempFile)
val pf1 = sqlContext.read.parquet(parquetTempFile)
df1.join(broadcast(pf1)).count()
{code}

*Result*
It will trigger assertion in QueryPlanner.scala, like below:
{code}
scala> df1.join(broadcast(pf1)).count()
java.lang.AssertionError: assertion failed: No plan for BroadcastHint
+- Relation[key#6,value#7] 
ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet]

at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
{code}


> No plan for BroadcastHint in some condition
> ---
>
> Key: SPARK-12275
> URL: https://issues.apache.org/jira/browse/SPARK-12275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: yucai
>
> *Summary*
> No plan for BroadcastHint is generated in some condition.
> *Test Case*
> {code}
> val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
> val parquetTempFile =
>   "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), 
> scala.util.Random.nextInt)
> df1.write.parquet(parquetTempFile)
> val pf1 = sqlContext.read.parquet(parquetTempFile)
> #1. df1.join(broadcast(pf1)).count()
> #2. broadcast(pf1).count()
> {code}
> *Result*
> It will trigger assertion in QueryPlanner.scala, like below:
> {code}
> scala> df1.join(broadcast(pf1)).count()
> java.lang.AssertionError: assertion failed: No plan for BroadcastHint
> +- Relation[key#6,value#7] 
> ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet]
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> 

[jira] [Created] (SPARK-12280) "--packages" command doesn't work in "spark-submit"

2015-12-11 Thread Anton Loss (JIRA)
Anton Loss created SPARK-12280:
--

 Summary: "--packages" command doesn't work in "spark-submit"
 Key: SPARK-12280
 URL: https://issues.apache.org/jira/browse/SPARK-12280
 Project: Spark
  Issue Type: Bug
Reporter: Anton Loss
Priority: Minor


when running "spark-shell", then "--packages" option works as expected, but 
with "spark-submit" it produces following stacktrace
15/12/11 17:05:48 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/12/11 17:05:51 WARN Client: Resource 
file:/home/anton/data-tools-1.0-SNAPSHOT-jar-with-dependencies.jar added 
multiple times to distributed cache.
Exception in thread "main" java.io.FileNotFoundException: Requested file 
maprfs:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar does not 
exist.
at 
com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1332)
at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942)
at com.mapr.fs.MFS.getFileStatus(MFS.java:151)
at 
org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467)
at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2193)
at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2189)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2189)
at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:601)
at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:242)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:366)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:360)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:360)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:358)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:358)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:842)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


it seems it's looking in the wrong place, as jar is clearly present here
file:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12275) No plan for BroadcastHint in some condition

2015-12-11 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052916#comment-15052916
 ] 

yucai commented on SPARK-12275:
---

*Root Cause*
When  BasicOperators's "case BroadcastHint(child)" is hit (in 
SparkStrategies.scala), it will recursively invoke BasicOperators.apply by 
using "child":
{code}
case BroadcastHint(child) => apply(child)
{code}
If BasicOperators could not process that child properly, it will lead to "No 
plan for BroadcastHint".
{code}
case _ => Nil
{code}

In my example above, broadcast(pf1) hits "case BroadcastHint(child)", so the 
child is pf1, its type is "Relation[key#91,value#92] ParquetRelation".
And then using this child to invoke BasicOperators.apply again, unfortunately, 
this child type "Relation[key#91,value#92] ParquetRelation" could not match 
anything in BasicOperators.apply, so return Nil, "No plan for BroadcastHint".

*Solution*
Using planLater to invoke other execution strategies that are available

> No plan for BroadcastHint in some condition
> ---
>
> Key: SPARK-12275
> URL: https://issues.apache.org/jira/browse/SPARK-12275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: yucai
>
> *Summary*
> No plan for BroadcastHint is generated in some condition.
> *Test Case*
> {code}
> val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
> val parquetTempFile =
>   "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), 
> scala.util.Random.nextInt)
> df1.write.parquet(parquetTempFile)
> val pf1 = sqlContext.read.parquet(parquetTempFile)
> df1.join(broadcast(pf1)).count()
> {code}
> *Result*
> It will trigger assertion in QueryPlanner.scala, like below:
> {code}
> scala> df1.join(broadcast(pf1)).count()
> java.lang.AssertionError: assertion failed: No plan for BroadcastHint
> +- Relation[key#6,value#7] 
> ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet]
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9690) Add random seed Param to PySpark CrossValidator

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053588#comment-15053588
 ] 

Apache Spark commented on SPARK-9690:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/10268

> Add random seed Param to PySpark CrossValidator
> ---
>
> Key: SPARK-9690
> URL: https://issues.apache.org/jira/browse/SPARK-9690
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 1.4.1
>Reporter: Martin Menestret
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fold in the ML CrossValidator depends on a rand whose seed is set to 0 
> and it leads the sql.functions rand to call sc._jvm.functions.rand() with no 
> seed.
> In order to be able to unit test a Cross Validation it would be a good idea 
> to be able to set this seed so the output of the cross validation (with a 
> featureSubsetStrategy set to "all") would always be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12276:


Assignee: Apache Spark

> Prevent RejectedExecutionException by checking if ThreadPoolExecutor is 
> shutdown and its capacity
> -
>
> Key: SPARK-12276
> URL: https://issues.apache.org/jira/browse/SPARK-12276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> We noticed that it is possible to throw RejectedExecutionException when 
> submitting thread in AppClient. The error is like following. We should add 
> some checks to prevent it.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@2077082c rejected from 
> java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12276:


Assignee: (was: Apache Spark)

> Prevent RejectedExecutionException by checking if ThreadPoolExecutor is 
> shutdown and its capacity
> -
>
> Key: SPARK-12276
> URL: https://issues.apache.org/jira/browse/SPARK-12276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We noticed that it is possible to throw RejectedExecutionException when 
> submitting thread in AppClient. The error is like following. We should add 
> some checks to prevent it.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@2077082c rejected from 
> java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9694:
-
Assignee: Yanbo Liang

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9694:
-
Shepherd: Joseph K. Bradley
Target Version/s: 2.0.0  (was: )

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12284) Output UnsafeRow from window function

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12284:
--

 Summary: Output UnsafeRow from window function
 Key: SPARK-12284
 URL: https://issues.apache.org/jira/browse/SPARK-12284
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11529) Add section in user guide for StreamingLogisticRegressionWithSGD

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11529:
--
Target Version/s:   (was: 1.6.0)

> Add section in user guide for StreamingLogisticRegressionWithSGD
> 
>
> Key: SPARK-11529
> URL: https://issues.apache.org/jira/browse/SPARK-11529
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> [~freeman-lab] Would you be able to do this for 1.6?  Or if there are others 
> who can, could you please ping them?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12286:
--

 Summary: Support UnsafeRow in all SparkPlan (if possible)
 Key: SPARK-12286
 URL: https://issues.apache.org/jira/browse/SPARK-12286
 Project: Spark
  Issue Type: Epic
  Components: SQL
Reporter: Davies Liu


There are still some SparkPlan does not support UnsafeRow (or does not support 
well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12289:
--

 Summary: Support UnsafeRow in TakeOrderedAndProject/Limit
 Key: SPARK-12289
 URL: https://issues.apache.org/jira/browse/SPARK-12289
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-12-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053616#comment-15053616
 ] 

Yin Huai commented on SPARK-11885:
--

[~davies] btw, which exprId was generated at executor side?

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3
>
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch.
> Try the following in spark 1.5 (with a cluster) and you can see the problem.
> {code}
> import java.math.BigDecimal
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType, StructField, DataType, 
> DoubleType, LongType}
> class GeometricMean extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(StructField("value", DoubleType) :: Nil)
>   def bufferSchema: StructType = StructType(
> StructField("count", LongType) ::
>   StructField("product", DoubleType) :: Nil
>   )
>   def dataType: DataType = DoubleType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
> buffer(1) = 1.0
>   }
>   def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> buffer(0) = buffer.getAs[Long](0) + 1
> buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
> buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
>   }
>   def evaluate(buffer: Row): Any = {
> math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0))
>   }
> }
> sqlContext.udf.register("gm", new GeometricMean)
> val df = Seq(
>   (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"),
>   (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"),
>   (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"),
>   (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"),
>   (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"),
>   (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")).
>   toDF("receipt_id", "store_country", "store_region", "store_id", "amount", 
> "seller_name")
> df.registerTempTable("receipts")
>   
> val q = sql("""
> select   store_country,
>  store_region,
>  avg(amount),
>  sum(amount),
>  gm(amount)
> from receipts
> whereamount > 50
>  and store_country = 'italy'
> group by store_country, store_region
> """)
> q.show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12047) Unhelpful error messages generated by JavaDoc while doing sbt unidoc

2015-12-11 Thread Neelesh Srinivas Salian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian closed SPARK-12047.
---
Resolution: Duplicate

> Unhelpful error messages generated by JavaDoc while doing sbt unidoc
> 
>
> Key: SPARK-12047
> URL: https://issues.apache.org/jira/browse/SPARK-12047
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>
> I'm not quite familiar with the internal mechanism of the SBT Unidoc plugin, 
> but it seems that it tries to convert Scala files into Java files and then 
> run {{javadoc}} over generated files to produces JavaDoc pages.
> During this process, {{javadoc}} keeps producing unhelpful error messages 
> like:
> {noformat}
> [error] 
> /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/ml/PredictionModel.java:16:
>  error: unknown tag: group
> [error]   /** @group setParam */
> [error]   ^
> [error] 
> /Users/lian/local/src/spark/branch-1.6/graphx/target/java/org/apache/spark/graphx/lib/PageRank.java:83:
>  error: unknown tag: tparam
> [error]* @tparam ED the original edge attribute (not used)
> [error]  ^
> [error] 
> /Users/lian/local/src/spark/branch-1.6/core/target/java/org/apache/spark/ContextCleaner.java:76:
>  error: BlockManagerMaster is not public in org.apache.spark.storage; cannot 
> be accessed from outside package
> [error]   private  org.apache.spark.storage.BlockManagerMaster 
> blockManagerMaster () { throw new RuntimeException(); }
> [error]^
> [error] 
> /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.java:72:
>  error: reference not found
> [error]* if it is being added to a {@link DenseMatrix}. If two dense 
> matrices are added, the output will
> [error]   ^
> {noformat}
> The {{scaladoc}} tool also produces tons of warning messages like:
> {noformat}
> [warn] 
> /Users/lian/local/src/spark/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/Column.scala:1117:
>  Could not find any member to link for "StructField".
> [warn]   /**
> [warn]   ^
> {noformat}
> (This one is probably because of 
> [SI-3695|https://issues.scala-lang.org/browse/SI-3695] and 
> [SI-8734|https://issues.scala-lang.org/browse/SI-8734]).
> The problem is that they covered the real problems, and bring difficulty for 
> API doc auditing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12290) Change the default value in SparkPlan

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12290:
--

 Summary: Change the default value in SparkPlan
 Key: SPARK-12290
 URL: https://issues.apache.org/jira/browse/SPARK-12290
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


supportUnsafeRows = true
supportSafeRows = false  //
outputUnsafeRows = true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: screenshot-1.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: screenshot-1.png, training-log1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-11 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053810#comment-15053810
 ] 

Josh Rosen commented on SPARK-6270:
---

While I think that we should have this discussion about UI reconstruction of 
long-running applications, I think this is orthogonal to the right solution for 
this issue (SPARK-6270). The root problem here, related to the master / cluster 
manager dying, seems to be caused by a design flaw: why is the master 
responsible for serving historical UIs? The standalone history server process 
should have that responsibility, since UI serving might need a lot of memory.

I think the right fix here is to just remove the Master's embedded history 
server; I just don't think it makes sense to assign history server 
responsibilities to the master when it's designed to be a very 
low-resource-use, high-stability, high-resiliency service.

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053808#comment-15053808
 ] 

Evan Chen commented on SPARK-10931:
---

Hey Joseph,

Thanks for the suggestion. 
I was wondering what model abstraction and getattr method are you referring to?
I modified every model on the Python side to reflect how it is being done on 
the Scala side. 
Let me know what you think.

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: training-log3.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png, training-log2.pnd.png, 
> training-log3.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12282) Document spark.jars

2015-12-11 Thread Justin Bailey (JIRA)
Justin Bailey created SPARK-12282:
-

 Summary: Document spark.jars
 Key: SPARK-12282
 URL: https://issues.apache.org/jira/browse/SPARK-12282
 Project: Spark
  Issue Type: Documentation
Reporter: Justin Bailey


The spark.jars property (as implemented in SparkSubmit.scala,  
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516)
 is not documented anywhere, and should be.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12280) "--packages" command doesn't work in "spark-submit"

2015-12-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12280:
--
Component/s: Spark Submit

> "--packages" command doesn't work in "spark-submit"
> ---
>
> Key: SPARK-12280
> URL: https://issues.apache.org/jira/browse/SPARK-12280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Anton Loss
>Priority: Minor
>
> when running "spark-shell", then "--packages" option works as expected, but 
> with "spark-submit" it produces following stacktrace
> 15/12/11 17:05:48 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/12/11 17:05:51 WARN Client: Resource 
> file:/home/anton/data-tools-1.0-SNAPSHOT-jar-with-dependencies.jar added 
> multiple times to distributed cache.
> Exception in thread "main" java.io.FileNotFoundException: Requested file 
> maprfs:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar does 
> not exist.
>   at 
> com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1332)
>   at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:942)
>   at com.mapr.fs.MFS.getFileStatus(MFS.java:151)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467)
>   at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2193)
>   at org.apache.hadoop.fs.FileContext$25.next(FileContext.java:2189)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2189)
>   at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:601)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:242)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$1.apply(Client.scala:360)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:360)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6.apply(Client.scala:358)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:358)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:842)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> it seems it's looking in the wrong place, as jar is clearly present here
> file:///home/mapr/.ivy2/jars/com.databricks_spark-csv_2.11-1.3.0.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11497:
--
Target Version/s: 1.6.1, 2.0.0

> PySpark RowMatrix Constructor Has Type Erasure Issue
> 
>
> Key: SPARK-11497
> URL: https://issues.apache.org/jira/browse/SPARK-11497
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Mike Dusenberry
>Assignee: Mike Dusenberry
>Priority: Minor
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
> RowMatrix constructor. As discussed on the dev list 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
>  there appears to be an issue with type erasure with RDDs coming from Java, 
> and by extension from PySpark. Although we are attempting to construct a 
> RowMatrix from an RDD[Vector] in 
> [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
>  the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
> Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which 
> an Object cannot be cast to a Spark Vector. As noted in the aforementioned 
> dev list thread, this issue was also encountered with DecisionTrees, and the 
> fix involved an explicit retag of the RDD with a Vector type. Thus, this PR 
> will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
> IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
> due to their related helper functions in PythonMLlibAPI creating the RDDs 
> explicitly from DataFrames with pattern matching, thus preserving the types. 
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12283) Use UnsafeRow as the buffer in SortBasedAggregation to avoid Unsafe/Safe conversion

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12283:
--

 Summary: Use UnsafeRow as the buffer in SortBasedAggregation to 
avoid Unsafe/Safe conversion
 Key: SPARK-12283
 URL: https://issues.apache.org/jira/browse/SPARK-12283
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


SortBasedAggregation use GenericMutableRow as aggregation buffer, also requires 
that the input can't be UnsafeRow, because the we can't compare/evaluate 
UnsafeRow and GenericInternalRow  in the same time. The TungstenSort output 
UnsafeRow, so multiple Safe/Unsafe projections will be inserted between them.

If we can make sure that all the mutating happens in ascending order, the 
buffer of UnsafeRow could be used to update var-length object (String, Binary, 
Struct etc.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12217:


Assignee: Apache Spark

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Assignee: Apache Spark
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12217:


Assignee: (was: Apache Spark)

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12217:


Assignee: Apache Spark

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Assignee: Apache Spark
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2015-12-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12286:
--

Assignee: Davies Liu

> Support UnsafeRow in all SparkPlan (if possible)
> 
>
> Key: SPARK-12286
> URL: https://issues.apache.org/jira/browse/SPARK-12286
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There are still some SparkPlan does not support UnsafeRow (or does not 
> support well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup

2015-12-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12287:
---
Issue Type: Improvement  (was: Epic)

> Support UnsafeRow in MapPartitions/MapGroups/CoGroup
> 
>
> Key: SPARK-12287
> URL: https://issues.apache.org/jira/browse/SPARK-12287
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11965:
--
Assignee: Yanbo Liang

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12288:
--

 Summary: Support UnsafeRow in Coalesce/Except/Intersect
 Key: SPARK-12288
 URL: https://issues.apache.org/jira/browse/SPARK-12288
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-12-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053613#comment-15053613
 ] 

Yin Huai commented on SPARK-11885:
--

Thanks [~davies]! [~milad.bourh...@gmail.com] Can you try our latest branch 1.5 
and see it is fixed for your case?

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3
>
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch.
> Try the following in spark 1.5 (with a cluster) and you can see the problem.
> {code}
> import java.math.BigDecimal
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType, StructField, DataType, 
> DoubleType, LongType}
> class GeometricMean extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(StructField("value", DoubleType) :: Nil)
>   def bufferSchema: StructType = StructType(
> StructField("count", LongType) ::
>   StructField("product", DoubleType) :: Nil
>   )
>   def dataType: DataType = DoubleType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
> buffer(1) = 1.0
>   }
>   def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> buffer(0) = buffer.getAs[Long](0) + 1
> buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
> buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
>   }
>   def evaluate(buffer: Row): Any = {
> math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0))
>   }
> }
> sqlContext.udf.register("gm", new GeometricMean)
> val df = Seq(
>   (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"),
>   (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"),
>   (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"),
>   (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"),
>   (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"),
>   (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")).
>   toDF("receipt_id", "store_country", "store_region", "store_id", "amount", 
> "seller_name")
> df.registerTempTable("receipts")
>   
> val q = sql("""
> select   store_country,
>  store_region,
>  avg(amount),
>  sum(amount),
>  gm(amount)
> from receipts
> whereamount > 50
>  and store_country = 'italy'
> group by store_country, store_region
> """)
> q.show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12047) Unhelpful error messages generated by JavaDoc while doing sbt unidoc

2015-12-11 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053614#comment-15053614
 ] 

Neelesh Srinivas Salian commented on SPARK-12047:
-

Closing these since they are duplicated by the above mentioned JIRAs.
Thank you.

> Unhelpful error messages generated by JavaDoc while doing sbt unidoc
> 
>
> Key: SPARK-12047
> URL: https://issues.apache.org/jira/browse/SPARK-12047
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>
> I'm not quite familiar with the internal mechanism of the SBT Unidoc plugin, 
> but it seems that it tries to convert Scala files into Java files and then 
> run {{javadoc}} over generated files to produces JavaDoc pages.
> During this process, {{javadoc}} keeps producing unhelpful error messages 
> like:
> {noformat}
> [error] 
> /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/ml/PredictionModel.java:16:
>  error: unknown tag: group
> [error]   /** @group setParam */
> [error]   ^
> [error] 
> /Users/lian/local/src/spark/branch-1.6/graphx/target/java/org/apache/spark/graphx/lib/PageRank.java:83:
>  error: unknown tag: tparam
> [error]* @tparam ED the original edge attribute (not used)
> [error]  ^
> [error] 
> /Users/lian/local/src/spark/branch-1.6/core/target/java/org/apache/spark/ContextCleaner.java:76:
>  error: BlockManagerMaster is not public in org.apache.spark.storage; cannot 
> be accessed from outside package
> [error]   private  org.apache.spark.storage.BlockManagerMaster 
> blockManagerMaster () { throw new RuntimeException(); }
> [error]^
> [error] 
> /Users/lian/local/src/spark/branch-1.6/mllib/target/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.java:72:
>  error: reference not found
> [error]* if it is being added to a {@link DenseMatrix}. If two dense 
> matrices are added, the output will
> [error]   ^
> {noformat}
> The {{scaladoc}} tool also produces tons of warning messages like:
> {noformat}
> [warn] 
> /Users/lian/local/src/spark/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/Column.scala:1117:
>  Could not find any member to link for "StructField".
> [warn]   /**
> [warn]   ^
> {noformat}
> (This one is probably because of 
> [SI-3695|https://issues.scala-lang.org/browse/SI-3695] and 
> [SI-8734|https://issues.scala-lang.org/browse/SI-8734]).
> The problem is that they covered the real problems, and bring difficulty for 
> API doc auditing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12273) Spark Streaming Web UI does not list Receivers in order

2015-12-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12273.
--
   Resolution: Fixed
 Assignee: (was: Apache Spark)
Fix Version/s: 2.0.0

> Spark Streaming Web UI does not list Receivers in order
> ---
>
> Key: SPARK-12273
> URL: https://issues.apache.org/jira/browse/SPARK-12273
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: Spark-12273.png
>
>
> Currently the Streaming web UI does NOT list Receivers in order, while it 
> seems more convenient for the users if Receivers are listed in order.
> !Spark-12273.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053648#comment-15053648
 ] 

Apache Spark commented on SPARK-12281:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10269

> Fixed potential exceptions when exiting a local cluster.
> 
>
> Key: SPARK-12281
> URL: https://issues.apache.org/jira/browse/SPARK-12281
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Fixed the following potential exceptions when exiting a local cluster.
> {code}
> java.lang.AssertionError: assertion failed: executor 4 state transfer from 
> RUNNING to RUNNING is illegal
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
> shutdown.
>   at 
> org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
>   at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
>   at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
>   at 
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12285) MLlib user guide: umbrella for missing sections

2015-12-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12285:
-

 Summary: MLlib user guide: umbrella for missing sections
 Key: SPARK-12285
 URL: https://issues.apache.org/jira/browse/SPARK-12285
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


This is an umbrella for updating the MLlib user/programming guide for new APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053734#comment-15053734
 ] 

Joseph K. Bradley commented on SPARK-11606:
---

I'm going to split off the remaining guide sections into a new umbrella JIRA so 
that I can close this one.

> ML 1.6 QA: Update user guide for new APIs
> -
>
> Key: SPARK-11606
> URL: https://issues.apache.org/jira/browse/SPARK-11606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: Now that we have algorithms in spark.ml which are not in spark.mllib, 
> we should make subsections for the spark.ml API as needed. We can follow the 
> structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11606.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> ML 1.6 QA: Update user guide for new APIs
> -
>
> Key: SPARK-11606
> URL: https://issues.apache.org/jira/browse/SPARK-11606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.6.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: Now that we have algorithms in spark.ml which are not in spark.mllib, 
> we should make subsections for the spark.ml API as needed. We can follow the 
> structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: (was: screenshot-1.png)

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: training-log2.pnd.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png, training-log2.pnd.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2015-12-11 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-12297:
-

 Summary: Add work-around for Parquet/Hive int96 timestamp bug.
 Key: SPARK-12297
 URL: https://issues.apache.org/jira/browse/SPARK-12297
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Reporter: Ryan Blue


Hive has a bug where timestamps in Parquet data are incorrectly adjusted as 
though they were in the SQL session time zone to UTC. This is incorrect 
behavior because timestamp values are SQL timestamp without time zone and 
should not be internally changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12281:


Assignee: Apache Spark

> Fixed potential exceptions when exiting a local cluster.
> 
>
> Key: SPARK-12281
> URL: https://issues.apache.org/jira/browse/SPARK-12281
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Fixed the following potential exceptions when exiting a local cluster.
> {code}
> java.lang.AssertionError: assertion failed: executor 4 state transfer from 
> RUNNING to RUNNING is illegal
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
> shutdown.
>   at 
> org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
>   at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
>   at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
>   at 
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-12-11 Thread Milad Bourhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053634#comment-15053634
 ] 

Milad Bourhani commented on SPARK-11885:


Sure, I'll give it a go next week :) I'll write the results here.

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3
>
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch.
> Try the following in spark 1.5 (with a cluster) and you can see the problem.
> {code}
> import java.math.BigDecimal
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType, StructField, DataType, 
> DoubleType, LongType}
> class GeometricMean extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(StructField("value", DoubleType) :: Nil)
>   def bufferSchema: StructType = StructType(
> StructField("count", LongType) ::
>   StructField("product", DoubleType) :: Nil
>   )
>   def dataType: DataType = DoubleType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
> buffer(1) = 1.0
>   }
>   def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> buffer(0) = buffer.getAs[Long](0) + 1
> buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
> buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
>   }
>   def evaluate(buffer: Row): Any = {
> math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0))
>   }
> }
> sqlContext.udf.register("gm", new GeometricMean)
> val df = Seq(
>   (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"),
>   (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"),
>   (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"),
>   (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"),
>   (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"),
>   (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")).
>   toDF("receipt_id", "store_country", "store_region", "store_id", "amount", 
> "seller_name")
> df.registerTempTable("receipts")
>   
> val q = sql("""
> select   store_country,
>  store_region,
>  avg(amount),
>  sum(amount),
>  gm(amount)
> from receipts
> whereamount > 50
>  and store_country = 'italy'
> group by store_country, store_region
> """)
> q.show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11497:
--
Target Version/s: 1.5.3, 1.6.1, 2.0.0  (was: 1.6.1, 2.0.0)

> PySpark RowMatrix Constructor Has Type Erasure Issue
> 
>
> Key: SPARK-11497
> URL: https://issues.apache.org/jira/browse/SPARK-11497
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Mike Dusenberry
>Assignee: Mike Dusenberry
>Priority: Minor
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
> RowMatrix constructor. As discussed on the dev list 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
>  there appears to be an issue with type erasure with RDDs coming from Java, 
> and by extension from PySpark. Although we are attempting to construct a 
> RowMatrix from an RDD[Vector] in 
> [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
>  the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
> Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which 
> an Object cannot be cast to a Spark Vector. As noted in the aforementioned 
> dev list thread, this issue was also encountered with DecisionTrees, and the 
> fix involved an explicit retag of the RDD with a Vector type. Thus, this PR 
> will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
> IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
> due to their related helper functions in PythonMLlibAPI creating the RDDs 
> explicitly from DataFrames with pattern matching, thus preserving the types. 
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11497.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   1.5.3
   2.0.0

Issue resolved by pull request 9458
[https://github.com/apache/spark/pull/9458]

> PySpark RowMatrix Constructor Has Type Erasure Issue
> 
>
> Key: SPARK-11497
> URL: https://issues.apache.org/jira/browse/SPARK-11497
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Mike Dusenberry
>Assignee: Mike Dusenberry
>Priority: Minor
> Fix For: 2.0.0, 1.5.3, 1.6.1
>
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
> RowMatrix constructor. As discussed on the dev list 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
>  there appears to be an issue with type erasure with RDDs coming from Java, 
> and by extension from PySpark. Although we are attempting to construct a 
> RowMatrix from an RDD[Vector] in 
> [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
>  the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
> Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which 
> an Object cannot be cast to a Spark Vector. As noted in the aforementioned 
> dev list thread, this issue was also encountered with DecisionTrees, and the 
> fix involved an explicit retag of the RDD with a Vector type. Thus, this PR 
> will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
> IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
> due to their related helper functions in PythonMLlibAPI creating the RDDs 
> explicitly from DataFrames with pattern matching, thus preserving the types. 
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053721#comment-15053721
 ] 

Apache Spark commented on SPARK-10931:
--

User 'evanyc15' has created a pull request for this issue:
https://github.com/apache/spark/pull/10270

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6518) Add example code and user guide for bisecting k-means

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6518:
-
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Add example code and user guide for bisecting k-means
> -
>
> Key: SPARK-6518
> URL: https://issues.apache.org/jira/browse/SPARK-6518
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12215) User guide section for KMeans in spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12215:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> User guide section for KMeans in spark.ml
> -
>
> Key: SPARK-12215
> URL: https://issues.apache.org/jira/browse/SPARK-12215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>
> [~yuu.ishik...@gmail.com] Will you have time to add a user guide section for 
> this?  Thanks in advance!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12291:
--

 Summary: Support UnsafeRow in BroadcastLeftSemiJoinHash
 Key: SPARK-12291
 URL: https://issues.apache.org/jira/browse/SPARK-12291
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6725) Model export/import for Pipeline API

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Comment: was deleted

(was: User 'anabranch' has created a pull request for this issue:
https://github.com/apache/spark/pull/10179)

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-11 Thread Phil Kallos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053639#comment-15053639
 ] 

Phil Kallos commented on SPARK-11193:
-

yes, code looks great to me, thanks JB and Sean. any indication that this will 
make the 1.6 release?

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12282) Document spark.jars

2015-12-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12282:
--
   Priority: Trivial  (was: Major)
Component/s: Documentation

I don't see evidence this is intended to be exposed to end users, so I'd close 
this.

> Document spark.jars
> ---
>
> Key: SPARK-12282
> URL: https://issues.apache.org/jira/browse/SPARK-12282
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Justin Bailey
>Priority: Trivial
>
> The spark.jars property (as implemented in SparkSubmit.scala,  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516)
>  is not documented anywhere, and should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053743#comment-15053743
 ] 

Joseph K. Bradley commented on SPARK-11959:
---

[~yanboliang] Will you have time to write this guide section?  If not, please 
let me know.

> Document normal equation solver for ordinary least squares in user guide
> 
>
> Key: SPARK-11959
> URL: https://issues.apache.org/jira/browse/SPARK-11959
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Assigning since you wrote the feature, but please reassign as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11959:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Document normal equation solver for ordinary least squares in user guide
> 
>
> Key: SPARK-11959
> URL: https://issues.apache.org/jira/browse/SPARK-11959
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Assigning since you wrote the feature, but please reassign as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11965:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: training-log1.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.

2015-12-11 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-12281:


 Summary: Fixed potential exceptions when exiting a local cluster.
 Key: SPARK-12281
 URL: https://issues.apache.org/jira/browse/SPARK-12281
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Shixiong Zhu


Fixed the following potential exceptions when exiting a local cluster.
{code}
java.lang.AssertionError: assertion failed: executor 4 state transfer from 
RUNNING to RUNNING is illegal
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
{code}
java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
shutdown.
at 
org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
at 
org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
at 
org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
at 
org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
at 
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12247:
--
Parent Issue: SPARK-12285  (was: SPARK-8517)

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11529) Add section in user guide for StreamingLogisticRegressionWithSGD

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11529:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Add section in user guide for StreamingLogisticRegressionWithSGD
> 
>
> Key: SPARK-11529
> URL: https://issues.apache.org/jira/browse/SPARK-11529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> [~freeman-lab] Would you be able to do this for 1.6?  Or if there are others 
> who can, could you please ping them?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12293) Support UnsafeRow in LocalTableScan

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12293:
--

 Summary: Support UnsafeRow in LocalTableScan
 Key: SPARK-12293
 URL: https://issues.apache.org/jira/browse/SPARK-12293
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12294) Support UnsafeRow in HiveTableScan

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12294:
--

 Summary: Support UnsafeRow in HiveTableScan
 Key: SPARK-12294
 URL: https://issues.apache.org/jira/browse/SPARK-12294
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053808#comment-15053808
 ] 

Evan Chen edited comment on SPARK-10931 at 12/11/15 11:51 PM:
--

Hey Joseph,

Thanks for the suggestion. 
What model abstraction and getattr method are you referring to?
I modified every model on the Python side to reflect how it is being done on 
the Scala side. 
Let me know what you think.


was (Author: evanchen92):
Hey Joseph,

Thanks for the suggestion. 
I was wondering what model abstraction and getattr method are you referring to?
I modified every model on the Python side to reflect how it is being done on 
the Scala side. 
Let me know what you think.

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12296:
-

 Summary: Feature parity for pyspark.mllib StandardScalerModel
 Key: SPARK-12296
 URL: https://issues.apache.org/jira/browse/SPARK-12296
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Priority: Minor


Some methods are missing, such as ways to access the std, mean, etc.  This JIRA 
is for feature parity for pyspark.mllib.feature.StandardScaler & 
StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12296:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-11937

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6523) Error when get attribute of StandardScalerModel, When use python api

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053830#comment-15053830
 ] 

Joseph K. Bradley commented on SPARK-6523:
--

You're right; sorry I did not see that PR as it was put into Spark.  I just 
made one specific to your need: [SPARK-12296]

> Error when get attribute of StandardScalerModel, When use python api
> 
>
> Key: SPARK-6523
> URL: https://issues.apache.org/jira/browse/SPARK-6523
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: lee.xiaobo.2006
>
> test code
> ===
> from pyspark.mllib.util import MLUtils
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.feature import StandardScaler
> conf = SparkConf().setAppName('Test')
> sc = SparkContext(conf=conf)
> data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
> label = data.map(lambda x: x.label)
> features = data.map(lambda x: x.features)
> scaler1 = StandardScaler().fit(features)
> print scaler1.std   # error
> sc.stop()
> ---
> error:
> Traceback (most recent call last):
>   File "/data1/s/apps/spark-app/app/test_ssm.py", line 22, in 
> print scaler1.std
> AttributeError: 'StandardScalerModel' object has no attribute 'std'
> 15/03/25 12:17:28 INFO Utils: path = 
> /data1/s/apps/spark-1.4.0-SNAPSHOT/data/spark-eb1ed7c0-a5ce-4748-a817-3cb0687ee282/blockmgr-5398b477-127d-4259-a71b-608a324e1cd3,
>  already present as root for deletion.
> =
> Another question, how to serialize or save the scaler model ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10931:


Assignee: Apache Spark

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10931:


Assignee: (was: Apache Spark)

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12287:
--

 Summary: Support UnsafeRow in MapPartitions/MapGroups/CoGroup
 Key: SPARK-12287
 URL: https://issues.apache.org/jira/browse/SPARK-12287
 Project: Spark
  Issue Type: Epic
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12295) Manage the memory used by window function

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12295:
--

 Summary: Manage the memory used by window function
 Key: SPARK-12295
 URL: https://issues.apache.org/jira/browse/SPARK-12295
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


The buffered rows for a given frame should use UnsafeRow, and stored as pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053790#comment-15053790
 ] 

Joseph K. Bradley commented on SPARK-11606:
---

I'll close this now that [SPARK-12285] contains the remaining open tasks.

> ML 1.6 QA: Update user guide for new APIs
> -
>
> Key: SPARK-11606
> URL: https://issues.apache.org/jira/browse/SPARK-11606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: Now that we have algorithms in spark.ml which are not in spark.mllib, 
> we should make subsections for the spark.ml API as needed. We can follow the 
> structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12217:
--
Assignee: Benjamin Fradet  (was: Apache Spark)

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12183) Remove spark.mllib tree, forest implementations and use spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053823#comment-15053823
 ] 

Joseph K. Bradley commented on SPARK-12183:
---

Lower priority than both, really.  This is more of a clean-up task.  We could 
still improve the spark.ml code without doing this task, and GBT can be handled 
as a separate JIRA.  I'd say moving GBT code to spark.ml is higher priority 
than this since that is blocking adding more output columns to GBTs 
(rawPrediction, probability).

> Remove spark.mllib tree, forest implementations and use spark.ml
> 
>
> Key: SPARK-12183
> URL: https://issues.apache.org/jira/browse/SPARK-12183
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> This JIRA is for replacing the spark.mllib decision tree and random forest 
> implementations with the one from spark.ml.  The spark.ml one should be used 
> as a wrapper.  This should involve moving the implementation, but should 
> probably not require changing the tests (much).
> This blocks on 1 improvement to spark.mllib which needs to be ported to 
> spark.ml: [SPARK-10064]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12282) Document spark.jars

2015-12-11 Thread Justin Bailey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053676#comment-15053676
 ] 

Justin Bailey edited comment on SPARK-12282 at 12/11/15 10:18 PM:
--

If you pass {{--conf spark.jars=".."}}, you can set this flag, which is 
actually pretty useful (its a consistent way to set configuration).

So maybe {{spark-submit}} should warn or throw if this configuration is 
included?


was (Author: m4dc4p):
If you pass `--conf spark.jars=".."`, you can set this flag, which is actually 
pretty useful (its a consistent way to set configuration).

So maybe spark-submit should warn or throw if this configuration is included?

> Document spark.jars
> ---
>
> Key: SPARK-12282
> URL: https://issues.apache.org/jira/browse/SPARK-12282
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Justin Bailey
>Priority: Trivial
>
> The spark.jars property (as implemented in SparkSubmit.scala,  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516)
>  is not documented anywhere, and should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12282) Document spark.jars

2015-12-11 Thread Justin Bailey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053676#comment-15053676
 ] 

Justin Bailey commented on SPARK-12282:
---

If you pass `--conf spark.jars=".."`, you can set this flag, which is actually 
pretty useful (its a consistent way to set configuration).

So maybe spark-submit should warn or throw if this configuration is included?

> Document spark.jars
> ---
>
> Key: SPARK-12282
> URL: https://issues.apache.org/jira/browse/SPARK-12282
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Justin Bailey
>Priority: Trivial
>
> The spark.jars property (as implemented in SparkSubmit.scala,  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516)
>  is not documented anywhere, and should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12292) Support UnsafeRow in Generate

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12292:
--

 Summary: Support UnsafeRow in Generate
 Key: SPARK-12292
 URL: https://issues.apache.org/jira/browse/SPARK-12292
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12217.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10257
[https://github.com/apache/spark/pull/10257]

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11937) Python API coverage check found issues for ML during 1.6 QA

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11937:
--
Comment: was deleted

(was: User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10085)

> Python API coverage check found issues for ML during 1.6 QA
> ---
>
> Key: SPARK-11937
> URL: https://issues.apache.org/jira/browse/SPARK-11937
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>
> Here is the todo list of SPARK-11604 found issues:
> Note: I did not list the SparkR related features (such as 
> ml.feature.Interaction). We have supported RFormula as a wrapper at Python 
> side, I think we should discuss the necessary to support other R related 
> features at Python side.
> * Missing classes
> ** ml.attribute SPARK-8516
> ** ml.feature 
> *** QuantileDiscretizer SPARK-11922
> *** ChiSqSelector SPARK-11923
> ** ml.classification
> *** OneVsRest SPARK-7861
> ** ml.clustering 
> *** LDA SPARK-11940
> ** mllib.clustering
> *** BisectingKMeans SPARK-11944
> * Missing methods/parameters SPARK-11938
> ** ml.classification SPARK-11815 SPARK-11820 
> ** ml.feature SPARK-11925
> ** ml.clustering SPARK-11945
> ** mllib.linalg SPARK-12040 SPARK-12041
> ** mllib.stat.test.StreamingTest SPARK-12042
> * Docs:
> ** ml.classification SPARK-11875



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053827#comment-15053827
 ] 

Wenmin Wu commented on SPARK-12272:
---

I didn't run a test, but the click-through data of my company. There are 28905 
features and 18644639 records in this data.

I trained a GBDT model with 200 trees(equal to iterations) and maxDepth = 7. 
From the 'training-log1', you can see the first splitting takes 9.7 min. 
However, in xgboost single node implementation it takes less than 10 secs.

At first, I thought this is due to the statistics communication, but I look 
into the the details log in single executor just as the 'training-log2' shows. 
You can see in the single executor these steps take 8 - 9 min. 

I persist all the data in memory, as shown in 'training-log3'.

I also look into the source of GBDT implementation in spark and found that the 
time complexity of finding the first split is O(K * N) which is the same as the 
implementation in xgboost. So I ask you how I can accelerate the training of 
GBDT with spark.

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png, training-log2.pnd.png, 
> training-log3.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10285) Add @since annotation to pyspark.ml.util

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053933#comment-15053933
 ] 

Joseph K. Bradley commented on SPARK-10285:
---

I'll close the issue.  Thanks!

> Add @since annotation to pyspark.ml.util
> 
>
> Key: SPARK-10285
> URL: https://issues.apache.org/jira/browse/SPARK-10285
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10285) Add @since annotation to pyspark.ml.util

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10285.
-
  Resolution: Not A Problem
Target Version/s:   (was: 1.6.0)

> Add @since annotation to pyspark.ml.util
> 
>
> Key: SPARK-10285
> URL: https://issues.apache.org/jira/browse/SPARK-10285
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10263) Add @Since annotation to ml.param and ml.*

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10263:
--
Target Version/s:   (was: 1.6.0)

> Add @Since annotation to ml.param and ml.*
> --
>
> Key: SPARK-10263
> URL: https://issues.apache.org/jira/browse/SPARK-10263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Hiroshi Takahashi
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12301) Remove final from classes in spark.ml trees and ensembles where possible

2015-12-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12301:
-

 Summary: Remove final from classes in spark.ml trees and ensembles 
where possible
 Key: SPARK-12301
 URL: https://issues.apache.org/jira/browse/SPARK-12301
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


There have been continuing requests (e.g., [SPARK-7131]) for allowing users to 
extend and modify MLlib models and algorithms.

I want this to happen for the next release.  For GBT, this may need to wait on 
some refactoring (to move the implementation to spark.ml).  But it could be 
done for trees already.  This will be broken into subtasks.

If you are a user who needs these changes, please comment here about what 
specifically needs to be modified for your use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15054000#comment-15054000
 ] 

Joseph K. Bradley commented on SPARK-7131:
--

Yes, I'm sorry about how long this has taken, but I have enough confidence in 
the API now proceed.  I've created a JIRA for doing this in the next release: 
[SPARK-12301], though I may not be able to look at this issue until January.  
Please post your thoughts there, and ping in early January if there is no 
activity.  Thank you!

> Move tree,forest implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-7131
> URL: https://issues.apache.org/jira/browse/SPARK-7131
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to change and improve the spark.ml API for trees and ensembles, but 
> we cannot change the old API in spark.mllib.  To support the changes we want 
> to make, we should move the implementation from spark.mllib to spark.ml.  We 
> will generalize and modify it, but will also ensure that we do not change the 
> behavior of the old API.
> There are several steps to this:
> 1. Copy the implementation over to spark.ml and change the spark.ml classes 
> to use that implementation, rather than calling the spark.mllib 
> implementation.  The current spark.ml tests will ensure that the 2 
> implementations learn exactly the same models.  Note: This should include 
> performance testing to make sure the updated code does not have any 
> regressions. --> *UPDATE*: I have run tests using spark-perf, and there were 
> no regressions.
> 2. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation.  The spark.ml tests will again 
> ensure that we do not change any behavior.
> 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.
> This JIRA is now for step 1 only.  Steps 2 and 3 will be in separate JIRAs.
> After these updates, we can more safely generalize and improve the spark.ml 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053917#comment-15053917
 ] 

holdenk edited comment on SPARK-2870 at 12/12/15 1:27 AM:
--

So this seems to be resolved in Spark 1.6 with {code}createDataFrame{code}

e.g.:

{code}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
{code}StructType(List(StructField(a,LongType,true),StructField(b,StringType,true))){code}

Do you think its OK to close this issue?


was (Author: holdenk):
So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053917#comment-15053917
 ] 

holdenk edited comment on SPARK-2870 at 12/12/15 1:26 AM:
--

So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?


was (Author: holdenk):
So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code:python}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053917#comment-15053917
 ] 

holdenk commented on SPARK-2870:


So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code:python}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053978#comment-15053978
 ] 

Apache Spark commented on SPARK-12298:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/10271

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9578) Stemmer feature transformer

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9578:
---

Assignee: (was: Apache Spark)

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >